You are on page 1of 256

Journal of Software

ISSN 1796-217X

Volume 7, Number 6, June 2012

Contents

Special Issue: Current Research in Software Technologies


Guest Editors: Qihai Zhou and Junjie Wu

Guest Editorial 1177


Qihai Zhou and Junjie Wu

SPECIAL ISSUE PAPERS

Design Patent Image Retrieval Based on Shape and Color Features 1179
Zhiyuan Zeng and Wenli Yang

Image Enhancement Based on Selective - Retinex Fusion Algorithm 1187


Xuebo Jin, Jia Bao, and Jingjing Du

Research on Detectable and Indicative Forward Active Network Congestion Control Algorithm 1195
Jingyang Wang, Min Huang, Haiyao Wang, Liwei Guo, and Wanzhen Zhou

Makespan Minimization on Parallel Batch Processing Machines with Release Times and Job Sizes 1203
Shuguang Li

Ontology and CBR-based Dynamic Enterprise Knowledge Repository Construction 1211


Huiying Gao and Xiuxiu Chen

Convexity Conditions for Parameterized Surfaces 1219


Kui Fang, Lu-Ming Shen, Xiang-Yang Xu, and Jing Song

Research on Automatic Management Model to Personal Computer 1227


Yalin Song and Xin He

Study and Application of an Improved Clustering Algorithm 1234


Lijuan Zhou, Yuyan Chen, and Shuang Li

Dynamic Analysis for Geographical Profiling of Serial Cases Based on Bayesian-Time Series 1242
Guanli Huang and Guanhua Huang

Research on Dependable Distributed Systems for Smart Grid 1250


Qilin Li and Mingtian Zhou

The Application of SPSS Factor Analysis in the Evaluation of Corporate Social Responsibility 1258
Hongming Chen and Xiaocan Xiao

Relationship between Motivation and Behavior of SNS User 1265


Hui Chen

The Load Forecasting Model Based on Bayes-GRNN 1273


Yanmei Li and Jingmin Wang
Research on the Model Consumption Behavior and Social Networks Role of Digital Music 1281
Dan Liu, Tianchi Yang, and Liang Tan

A New Method of Medical Image Retrieval for Computer-Aided Diagnosis 1289


Hui Liu and Guochao Sun

REGULAR PAPERS

A Detailed Study of NHPP Software Reliability Models (Invited Paper) 1296


Richard Lai and Mohit Garg

Confidence Estimation for Graph-based Semi-supervised Learning 1307


Tao Guo and Guiyang Li

Semantically Enhanced Uyghur Information Retrieval Model 1315


Bo Ma, Yating Yang, Xi Zhou, and Junlin Zhou

Formalizing Domain-Specific Metamodeling Language XMML Based on First-order Logic 1321


Tao Jiang and Xin Wang

Framework and Implementation of the Virtual Item Bank System 1329


Wen-Wei Liao and Rong-Guey Ho

An Approach to Automated Runtime Verification for Timed Systems: Applications to Web Services 1338
Tien-Dung Cao, Richard Castanet, Patrick Felix, and Kevin Chiew

Algorithm of Diffraction for Standing Tree based on the Uniform Geometrical Theory of Diffraction 1351
Yun-Jie Xu, Wen-Bin Li, and Shu-Dong Xiu

Biddy a Multi-platform Academic BDD Package 1358


Robert Meolic

Implementation of Multi-objective Evolutionary Algorithm for Task Scheduling in Heterogeneous 1367


Distributed Systems
Yuanlong Chen, Dong Li, and Peijun Ma

Dominancebased Rough Intervalvalued Fuzzy Set in Incomplete Fuzzy Information System 1375
Minlun Yan

A Novel PIM System and its Effective Storage Compression Scheme 1385
Liang Huai Yang, Jian Zhou, Jiacheng Wang, and Mong Li Lee

Analyzing Effective Features based on User Intention for Enhanced Map Search 1393
Junki Matsuo, Daisuke Kitayama, Ryong Lee, and Kazutoshi Sumiya

Achieving Dynamic and Distributed Session Management with Chord for Software as a Service 1403
Cloud
Zeeshan Pervez, Asad Masood Khattak, Sungyoung Leey, and Young-Koo Lee

A Quick Emergency Response Model for Micro-blog Public Opinion Crisis Based on Text Sentiment 1413
Intensity
Mingjun Xin, Hanxiang Wu, and Zhihua Niu

A New Text Clustering Method Based on KSEP 1421


ZhanGang Hao
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1177

Special Issue on Current Research in Software Technologies

Guest Editorial
As the most major core of Computer Science and Information Technology, Software Technologies haves been
playing more and more important roles in human life, and their applications have been deep-going almost all fields of
today society. As with previous industrial revolution, Software Technologies in Computer Science and Information
Technology revolution stem from the current major innovation in science and technology, promote the tremendous
progress to world productive forces, and promote the new stage of modern human civilization. The greater and greater
demands of human civilization and society progress have been giving the special more giant feedback acting force to
develop Software Technologies.
Organized by International Information Technology and Applications Association (IITAA), International Forum on
Information Technology and Applications (IFITA) and International Forum on Computer Science-Technology and
Applications (IFITA) were successfully held annually since 2009, which aimed to supply the scholar interchange
platforms for the researchers all over the world. Some researches and findings in Software Technologies and their
applications are selected here:
Design Patent Image Retrieval Based on Shape and Color Features proposed a synthesis design patent image
retrieval method based on shape and color features.
Image Enhancement Based on Selective - Retinex Fusion Algorithm developed the modified Retinex algorithm
based on the S curve firstly.
Research on Detectable and Indicative Forward Active Network Congestion Control Algorithm presented a
Detectable and Indicative Forward Active network Congestion Control algorithm (DIFACC).
Makespan Minimization on Parallel Batch Processing Machines with Release Times and Job Sizes studied
the scheduling problem of minimizing makespan on parallel batch processing machines encountered in different
manufacturing environments.
Ontology and CBR-based Dynamic Enterprise Knowledge Repository Construction focused on an integrated
framework and operating processes of dynamic knowledge repository construction
Convexity Conditions for Parameterized Surfaces studied that a criterion for local convexity (concavity) of
parameterized surfaces is found, also, the criterion condition of binary function convex surfaces is obtained.
Research on Automatic Management Model to Personal Computer studied some interesting problems and key
values of knowledge base for automatic management.
Study and Application of an Improved Clustering Algorithm combined with the characteristics of the early
warning about students' grade, represents an optimization algorithm for serving and improving students study
Dynamic Analysis for Geographical Profiling of Serial Cases Based on Bayesian-Time Series establish three
new methodologies based on existing frameworks.
Research on Dependable Distributed Systems for Smart Grid studied and examined the question of
dependability and identify major challenges during the construction of dependable systems.
The Application of SPSS Factor Analysis in the Evaluation of Corporate Social Responsibility designed and
determined a standard evaluation model of corporate social responsibility of thermal power.
Relationship between Motivation and Behavior of SNS User focused on the influence of SNS network users
motivation on their behavior.
The Load Forecasting Model Based on Bayes-GRNN established the index system of GRNN forecasting model,
and then uses Bayes theory for reducing,.
Research on the Model Consumption Behavior and Social Networks Role of Digital Music aimed to find out
some interesting and valuable results about the model consumption behavior and social networks role of digital music.
A New Method of Medical Image Retrieval for Computer-Aided Diagnosis proposes a new approach to the
retrieval of medical images from traditional Markov Random Field model.
If the readers of this Special Issue could find and would enjoy something (such as the academic ideas, methods and
enlightening) from the papers in this Special Issue, our target is realized. .

Guest Editors:

Qihai Zhou
President, International Information Technology and Applications Association, IITAA, Hong Kong
Full Professor (Grade 2, in China), Southwestern University of Finance and Economics, China
General Chair, International Forum on Information Technology and Applications, IFITA
General Chair, International Forum on Computer Science-Technology and Applications, IFCSTA
General Chair, International Forum on Communication Technology and Applications, IFCTA

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1177-1178
1178 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Junjie Wu
Assistant Professor, Ph.D, National University of Defense Technology, China.

Qihai Zhou (1947-) is a Full Professor (from 1995), Doctors (and Masters) tutor and a head of
Information Technology Application Research Institute, School of Economic Information Engineering,
Southwestern University of Finance and Economics (SWUFE), China. He graduated in 1982 from Lanzhou
University, China; has been working in SWUFE since 1982, successively hold posts from teaching assistant
(1982-1987), lecturer (1987-1991), vice professor (1991-1995, promoted anomaly in 1991), professor (1995-
today, promoted anomaly in 1995); and got the titles of both Outstanding experts (enjoyed government
subsidies) with outstanding contributions of Sichuan province, China (summa cum laude of Sichuan
province government, 1996) and One hundred academic and managerial leading heads of China
informationalization (summa cum laude about this domain in China, 2006). He has published 46 academic
books and over 212 academic papers; and is President of IITAA (International Information Technology &
Applications Association), Chair or Organizing Chair of some important international conferences. His research interests are in
algorithm research, computational geometry, isomorphic information processing, economics & management computation, eBusiness,
and so on. More about Prof. Zhou Qihai is shown here: http://www.iitaa.com/member-ZhouQiHai.doc

Junjie Wu (1981-) is an assistant professor, PhD, works in National Laboratory for Parallel and
Distributed Processing, National University of Defense Technology, China. He is a member of IEEE, ACM and
CCF. He was born in Anhui, China on Nov. 7th, 1981. And he got Ph.D. degree in computer science and
technology from National University of Defense Technology, China. He has published many SCI, EI indexed
papers in leading journals, such as Journal of Computer Science and Technology, Chinese Journal of
Computers, Journal of Software, Journal of Computer Research and Development, etc. He got the best paper
award of HPC China 2009. Currently, his main research interests include parallel computing, computer
architecture and computer compilation.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1179

Design Patent Image Retrieval Based on Shape


and Color Features
Zhiyuan Zeng
Digital Engineering and Simulation Research Center, HuazhongUniversity of Science and Technology
Wuhan 430074, China
Email: waiwaiwang123@gmail.com

Wenli Yang
Digital Engineering and Simulation Research Center, Huazhong University of Science and Technology
Wuhan 430074, China
Email: 154653508@qq.com

AbstractBased on the concrete characteristics and application background of design patent, the design patent
application background of design patent, according to the image retrieval should emphasize on shape and color
deficiency of the present retrieval system of design patent features [4]-[6].
database, this paper proposed a synthesis design patent In this paper, a synthesis retrieval method for patent
image retrieval method based on shape and color features. images using shape and color features is proposed. After
First, we present a shape retrieval approach based on the the pre-processing of patent images, we extract the shape
morphological description moment invariants, which can features of patent images with the morphological
not only reflects the morphological characteristics of design description moment invariants, which can reflect overall
patent image, but also has translation, rotation and scale
and accurate morphological characteristics of patent
invariability. Second, an improved color retrieval method
based on the key local fuzzy color histogram which contains
image compared with wavelet modulus maxima and
the colors statistic and local distributing characteristics is contour description matrix. Then an improved color
used to improve the color retrieval accuracy. At last, set a retrieval method based on the key local fuzzy color
weight coefficient with the consideration of the similarity histogram is used to improve the color retrieval speed and
degree of the shape and color features. The experimental accuracy, and the study comparing of accumulation color
results indicated that the proposed method can retrieve histogram, the traditional fuzzy color histogram and the
similar design patent images fast and efficiently, and presented method is also discussed. The experimental
improve the recall ratio and precision of design patent results showed that the presented algorithm can
image retrieval. effectively realize design patent image retrieval and
recognition, and improve the recall ratio and precision of
Index TermsDesign patent image retrieval, similarity similarity evaluation of design patent image..
degree, morphological description moment invariants, key
local fuzzy color histogram, SAPSO II. METHODOLOGY

I. INTRODUCTION A. Shape retrieval based on the morphological


description moment invariants
In recent years, with the development of international
situation and the accelerating process of integrated world In the design of automatic retrieval system based on
trade, the intellectual property protection has been paid image content shape feature is an important means to
more and more attention all over the word. Due to the describe high visual features. At present, the shape feature
increased numbers of patent application which results in extraction methods can be divided into two types: shape
more and more difficulties in manual works on comparing feature extraction based on edge detection and shape
their similarity degree, its very necessary to establish an feature extraction region segmentation. But most of these
accurate and high-effective design patent image retrieval methods are dependent on correct extraction of regional
system , especially for reducing imitating dispute, boundary or complete segmentation of images, therefore,
providing help for patent infringement and realizing it is unavoidable to encounter with some difficulties for
genuine modern management. borderline blurred images and incomplete segmentation of
The design patent image retrieval belongs to the images. This paper proposed an new method based on
technology category of content-based image retrieval. At morphological description moment invariants, which can
present, there are many new technique methods since the not only reflect the more overall and accurate
content-based image retrieval (CBIR) [1]-[5] has been morphological characteristics of design patent image and
brought forward. Researchers focus on features extraction are not affected by the borderline blurred images and
and similarity measurement, while ignoring concrete incomplete segmentation of images, but also has
applications. Aiming at the concrete characteristics and translation , rotation and scale invariability [7]-[9].

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1179-1186
1180 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

1) Image Preprocessing
The image preprocessing is mainly to detect image
edge and get binary image. The scale of filters is a
difficult problem for edge detection. The bigger the scale
is, the better the effect of edge detection is; but the more
inaccurate the edge location is. Canny operator is
presented to detect image edge in multi-scale space. In
this paper, the Canny operator [7] is used to extract image
edge to overcome the contradiction between edge location
Figure 1. The morphological description moment invariants of design
and noise effect. patent images.
2) Morphological description moment invariants
The edge information of patent image can be 3) Similarity degree of morphological description
considered as the set of feature points, and the moment invariants
morphological center of image is the center point Suppose Sij1 , Sij2 are two different morphological
composed by image edge points. If the set of all edge
description moment invariants, then uij = Sij Sij
1 2
points is I = {( x, y ) | ( x, y ) I } , then the center point is
set as I o {xo , yo } . where (i = 1, 2
m; j = 1, 2, , n) , we can see that
Considering the computational complexity and 1 uij 1 , by adding elements in each column of a
rotation scale invariability, this paper proposes an row ui , we get the vector uR:
improved representation method of the radial distance and
angular orientation between edge point and center point: n
u R = {ui | ui = uij 2 } i=1, 2m (8)
x xo + y yo j =1
r= (1)
xmax xo + ymax yo uR represents the ratio of all spots which have the same
radial distance but different angular orientations(0~2 ).
y yo ymax yo
= arctan + arctan (2) Using similar method, we can also get u :
x xo xmax xo
m
Where, the farthest point from center point is u = {u j | u j = uij 2 } j=1, 2.n (9)
expressed as I max {xmax , ymax } and the distance to the i =1

center point is:


The similarity degree of radial distance R and
rmax = ( xmax xo ) 2 + ( ymax yo ) 2
angular orientation are introduced to have a more
(3)

The radial distance is divided in equal m parts, and the explicit expression in the distribution information of radial
angular orientation is divided in equal n parts. The interval distance and angular orientation:
of radial direction and angular orientation are as follows: m
rd = rmax / m (4) R = uiR (10)
i =1
d = 2 / n (5)
n
Let S (ri , j ) , which is the total number of all points = uj (11)
in the grid N (ri ~ ri + rd , j ~ j + d ) , is as follows: j =1

Then, the comprehensive judgment of the similarity


S (ri , j ) = {(r , ) | r [ri , ri + rd ], [ j , j + d ]} (6) degree of two different morphological description
moment invariants is defined as :
Then the element of matrix S is:
Sij = S (ri , j ) / I (7) = R + (1 ) Where [0,1] (12)

4) Parameter adjustment of retrieval precision


Sij is used to express the percentage of the total
points in area and this matrix S is called as morphological
Because of the different parameter m, n and
will
description moment invariants. Figure 1 shows the cause different retrieval results. In order to retrieve design
morphological description moment invariants of design patent image more effective and accurate, this paper uses
patent images. the simulated annealing particle swarm optimization
algorithm SAPSO to search the optimum parameters.
Set (m,n)=(24,12).
The basic idea of simulated annealing algorithm (SA)
[10] is to find the global optimal solution and almost

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1181

global optimal solution by simulated the annealing from the sample population is achieved when m=24, n=12,
process of high-temperature object. =0.63894.
Particle swarm optimization algorithm (PSO)[11-13]
is based on the group, and the individual of group is 5) Experimental analysis
moved to the better region according to the environment Simulation experiments are made in VC++ 6.0, and
fitness, where each individual is regarded as a particle in algorithm performance has been tested and analyzed,
D-dimensional space. The particle swarm is represented which includes four aspects:
as xi = {xi1 , xi 2 ,...xiD } , xi is the current position of (1) The translation, rotation and scale invariability:
produce some simple geometric figures (including triangle,
particle i , vi = {vi1 , vi 2 ,...viD } is the current speed of square, rectangle and circular) to verify the translation,
particle i; Pi = {Pi1 , Pi 2 ,...PiD } is the optimum position rotation and scale invariability of the morphological
description moment invariants.
of particle i so far, that is Pbesti ;
Table 1,Table 2 and Table 3 give respectively the
Pg = {Pg1 , Pg 2 ,...PgD } is the optimum position of the statistical data of the verification test between the contour
description matrix, wavelet modulus maxima and the
whole particle swarm so far, that is g best . For each
presented shape retrieval algorithm , where, the total
generation, the speed and position of particle i in D- number of design patent database is 1000 images,
dimensional space are adjusted as the following including 15 images of scale variation, 45 images of
formula: rotation variation, and 10 images of translation variation.
vid = vid + c1r1 ( Pid xid ) + c2 r2 ( Pgd xid ) (13) P represents the ratio between the number of image
returning retrieval and the total number of design patent
xid = xid + vid (14) database.
Figure 3 is the verification results of translation,
where, is the weight coefficient, c1 and c2 are the rotation and scale invariability, where Figure 3(a) is the
acceleration constant, r1 and r2 are the random function retrieval results of triangle; and Figure 3(b) is the retrieval
whose variation range is [0,1]. results of square.
Based on experimental results and analysis, the shape
retrieval based on the morphological description moment
invariants is applied in any case whether or not has the
translation , rotation and scale variances, and all the
examples have the desired results.
(2) The shape retrieval of design patent image: we
select 66 patent images of design patent database. The
experimental results are as follows: (Figure.4 is the case
diagram of design patent; Figure.5 is the shape retrieval
results of the design patent database, and the order of
decreasing similarity is from left to right and up to down.)
(3) In this paper, we used the SAPSO algorithm to get
the optimal weight coefficient , which can convergent the
optimum solution rapidly. Figure 6 is the comparison
results of optimal solution for SAPSO, PSO and SA.
(4) The experimental comparison can find out the new
shape retrieval method based on the morphological
description moment invariants can significantly improve
the recall ratio and precision ratio of design patent
retrieval than the contour description matrix and wavelet
modulus maxima. Figure 7 and Figure 8 are the
comparison results of recall ratio and precision ratio of the
contour description matrix, wavelet modulus maxima and
the presented shape retrieval algorithm.
The results achieved expected effective, but we should
see its inherent defects: it cannot include the color and
texture information. Therefore, the images with similar
contours but different contents were often retrieved. If the
color information is considered, we can get a much better
result.
Figure 2. The flow chart of SAPSO searching algorithm.

By experiment, we can see that the best retrieval


results of 200 randomly selected design patent images

2012 ACADEMY PUBLISHER


1182 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE I. THE STATISTICAL DATA OF SHAPE RETRIEVAL BASED


ON THE CONTOUR DESCRIPTION MATRIX (S-SCALE R-ROTATION T-
TRANSLATION)

P=5% P=10% P=15%


images The Images The images The
# present # present # present
S 2 10% 5 33.3% 12 80%
R 12 26.7% 22 48.9% 31 68.9%
T 10 100% 10 100% 10 100%

TABLE II. THE STATISTICAL DATA OF SHAPE RETRIEVAL BASED


ON THE WAVELET MODULUS MAXIMA (S-SCALE R-ROTATION T-
TRANSLATION)

P=5% P=10% P=15%


Images The Images The images The
# present # present # present Figure 4. The case diagram of design patent.
S 3 10% 6 33.3% 13 86.7%
R 14 31.1% 31 68.9% 38 84.4%
T 10 100% 10 100% 10 100%

TABLE III. THE STATISTICAL DATA OF SHAPE RETRIEVAL BASED


ON THE MORPHOLOGICAL DESCRIPTION MOMENT INVARIANTS (S-SCALE
R-ROTATION T-TRANSLATION)
P=5% P=10% P=15%
images The Images The images The
# present # present # present
S 5 33.3% 9 60% 15 100%
R 19 42.2% 37 82.2% 43 95.6%
T 10 100% 10 100% 10 100%

Figure 5. The shape retrieval results of the design patent database.

(a)

Figure 6. The comparison results of optimal solution for SAPSO,


PSO and SA.

(b)
Figure 3. The verification results of translation, rotation and scale
invariability.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1183

N
1
Where f hi =
N
u
j =1
hij , N represents the number of

pixel of a certain location P; uhij represents the


membership function of pixel j to pixel i.

1, Chj Chi
uhij = (16)
0, other
Where, Chi and Chj denote respectively the hue value of
i index and j index.
The key hue is based on statistics occurrence
frequency of the hue itself, when h=hi, whether H is the
key hue depends on the value of f hi f h (i 1) . The
Figure 7. The comparison results of recall ratio of the contour
description matrix, wavelet modulus maxima and the presented shape
determination of kh of the highest occurrence frequency
retrieval algorithm can simplify the fuzzy color histogram.
Then, the hues eigenvector in this certain location P is:
f h ( P ) = [ f h 1 , f h2 f hkh ] (17)

2) Similarity degree of key local fuzzy color


histogram
Suppose PiTi are the same corresponding location of
image P and T, which are divided into m parts. The key
local fuzzy color histogram of Pi is:

fhPi =[fh1 ,fh2 , ,fhkh ]


fsPi =[fs1 ,fs2 , , fsks ] (18)

fvPi =[fv1 ,fv2 , ,fhvh ]


Where, kh, ks, kv denote respectively the number of
Figure 8. the comparison results of precision ratio of the contour
description matrix, wavelet modulus maxima and the presented shape key local color of H, S and V.
retrieval algorithm Using the same method, we can also get the
corresponding features f H fS f v of HTi, STi, VTi in
Ti Ti Ti

B. 2.2. Color retrieval based on key local fuzzy color


Accumulative Histogram of Ti .Set:
histogram

( f -fhTi ) ,
Pi 2
Color transformation quantization is an important uiH = h
problem of color retrieval. Through comparing to various
( f -fsTi ) ,
2
color space (including RGB, OPP,YIQ, YUV, YCrCb, uiS = Pi
(19)
s
CIE, Munsell and HSV) [7],[14], only quantized HSV
color space is satisfied the above four attributes, and can
( f -fvTi )
Pi 2
match better with the visual characteristics. Therefore, this uiv = v
paper adopts the HSV color space. We first transform the
RGB spatial pixel points ( R, G , B ) [0,1] to HSV uih, uis, uiv denote respectively the similarity degree of
hue, saturation and value.
spatial pixel points ( H , S , V ) [0,1] .
Then, the integrated similarity i of the key color of
1) Key local fuzzy color histogram
Considering the global and local color features, this the area:
paper proposed a new color retrieval method based on the
key local fuzzy color histogram to improve the retrieval
i =h ui h +s uis +v ui v (20)
precision. The effect and robustness of fuzzy color
histogram are superior to general histogram [15]-[17]. A Where, h , s , v are the weight coefficients. In this
case study of huethe fuzzy color histogram of a certain paper, we set h = 0.45, s = 0.40, v = 0.15 .
location P in the image is known as follows:
Then, the overall similarity degree of the key local
f h ( P) = ( f h1 , f h 2 , , f hn ) (15) fuzzy color histogram of the certain part is:

2012 ACADEMY PUBLISHER


1184 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

m
= i /m (21)
i =1

3) Experimental analysis
(1) Figure.3 is the case diagram of design patent;
Figure.9 is the color retrieval results of the design patent
database, and the order of decreasing similarity is from
left to right and up to down, where we set
m=4,kh=10,ks=6,kv=6. Table 4 lists the comparing results
of the similarity of the case diagram and images to be
retrieved.
(2) The experimental comparison can find out the
improved color retrieval method based on key local fuzzy
color histogram can improve the recall ratio and precision
ratio of design patent retrieval than the traditional fuzzy
color histogram and accumulation color histogram. Figure 10. The comparison results of recall ratio of the traditional fuzzy
Figure 10 and Figure 11 are the comparison results of color histogram, accumulation color histogram and the presented color
recall ratio and precision ratio of the traditional fuzzy retrieval algorithm.
color histogram, accumulation color histogram and the
presented color retrieval algorithm.
(3) According to the retrieval speed, study of
comparing with these three color retrieval methods is also
discussed, Figure 12 is the comparison result of retrieval
speed between the traditional fuzzy color histogram,
accumulation color histogram and the presented color
retrieval algorithm.
We can see that some images colors are mainly
brown in humans eyes, but the distributions of colors are
different. The experimental results indicate that this
method is proved to accord with human visual
characteristics, and the retrieval speed is higher because of
the use of the key local color features.

Figure 11. The comparison results of precision ratio of the traditional


fuzzy color histogram, accumulation color histogram and the presented
color retrieval algorithm.

Figure 9. The color retrieval results of the design patent database

Figure 12. The comparison result of retrieval speed between the


traditional fuzzy color histogram, accumulation color histogram and the
presented color retrieval algorithm.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1185

C. Design patent image retrieval based on synthesis Case


Image 4 0.7720 0.7721 0.7717 0.7718 0.7719
features Case
According to the Patent Law of China [18], to judge Image 5 0.5627 0.5624 0.5619 0.5622 0.5623
Case
whether they are the same or similar design patent images, Image 6 0.9211 0.9206 0.9208 0.9207 0.9208
we should consider both shape features and color features,
that is, if the patent application products have both the
same shape and color, it belongs to infringement act; else
if the patent application products have the same color but
different shapes, it does not form an infringement. Thus,
according to the rules mentioned above in Patent Law,
we should adopt the method of synthesis retrieval based
on shape and color features, and this paper set a weight
coefficient of similarity degree considering shape and
color features. The weight coefficient is defined as below:
last = 1 +1- 2 (23)
Where, 1 represents the similarity degree of shape
features, 2 represents the similarity degree of color
features. In this paper, we also use SAPSO algorithm to
search the optimum parameters , which is determined
by experiment, that is = 0.6657 ,
Figure 13 is the synthesis retrieval result. Figure 14 and Figure 14. The comparison results of recall ratio of shape retrieval,
Figure 15 are the comparison results of recall ratio and color retrieval and synthesis retrieval.
precision of shape retrieval, color retrieval and synthesis
retrieval.
The experimental comparison find out the synthesis
retrieval method based on the presented shape and color
algorithms can significantly improve the recall ratio and
precision of design patent retrieval.

Figure 15. The comparison results of precision ratio of shape retrieval,


color retrieval and synthesis retrieval.

III. CONCLUSIONS AND PROSPECT


According to the basic characteristics of design patent
image and the deficiency of existing retrieval methods,
Figure 13. The synthesis retrieval result.
this paper presents a new synthesis retrieval method based
on the morphological description moment invariants and
TABLE IV. THE SIMILARITY DEGREE OF KEY LOCAL FUZZY COLOR key local fuzzy color histogram. We analyze from the
HISTOGRAM single shape retrieval to the single color retrieval and
synthesis retrieval combined with shape and color
One two three four Overall
part part part part similar features, and then draw the experiment conclusion as
similar similar similar similar follows:
degree degree degree degree
Case (1) The shape retrieval method based on the
Image 1 0.6518 0.6517 0.6516 0.6513 0.6516 morphological description moment invariants can not only
Case
Image 2 0.3218 0.3217 0.3212 0.3209 0.3214 effectively express the shape feature of design patent
Case image, but also has translation, rotation and scale
Image 3 0.6828 0.6821 0.6823 0.6824 0.6824

2012 ACADEMY PUBLISHER


1186 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

invariability through compared to contour description Trans.Pattern Analysis and Machine Intelligence,
matrix and wavelet modulus maxima. 2002,24(4): 509-522.
[9] Zhiyuan Zeng, Juan Zhao, Bin Xu. An Outward-
(2) The color retrieval method based on the key local appearance Patent-image Retrieval Approach Based on the
fuzzy color histogram gives consideration to both global Contour-description Matrix [C]. Japan-China Joint
color feature and local color distribution characteristic. Workshop on Frontier of Computer Science and
Technology, 2007, 86-89.
This method is more robust and efficient than the
[10] PANG Feng. The Principle of SA Algorithm and
traditional local fuzzy color histogram method and Algorithms application on Optimization Problem. Jilin
accumulation color histogram method. University. 2006, 6-23
(3) The synthesis retrieval method combined with [11] Madar J , Abonyi J , Szeifert F.Interactive particle swarm
optimization [C] Proceedings of the 2005 5th International
shape and color features can significantly improve the Conference on Intelligent Systems Design and
retrieval performance of the design patent retrieval Applications, 2005
system. However, the efficiency of the selection of [12] Lili LiXiyu Liu, Bo Zzhuang. Fuzzy C- means algor
weight coefficient and retrieval need to be further ithm based on simulated annealing Par ticle Swarm
perfection and improvement.. Optimization.[C] Computer Engineer ing and
Applications, 2008, 44( 30) : 170- 172.
[13] Gang Zou, Jixiang Sun. Application of Clustering Method
ACKNOWLEDGMENT Based on Particle Swarm Optimization in Image
The authors heartfelt thanks are due to the task group Segmentation.[C] Electronics Op tics & Control, 2009,
16(2): 15-17.
members for their assistance with the experiments and our
[14] Yixin Chen, Z.James.Robert Krovetz. Content-Based
friends and fellow classmates for valuable discussions and Image Retrieval by Clustering [C]. International
helps, especially to Juan Zhao and Bin Xu for the very Multimedia Conference Proceedings of the 5th ACM
useful discussions in the improvement of algorithm SIGMM International Workshop, 2003:193-200.
performance. Without these patient guidance and [15] Stricker M, Orengo M. Similarity of color images [C].
illuminating instruction, this thesis could not have reached IS&T/SPIE Conf. on Storage and Retrival for Image and
its present form. At last, the deepest gratitude is given to Video Database , 1995, vol(2420), San Jose, CA: Feb:
381~392.
the anonymous reviewers and editors for their
constructive comments. [16] Vertan C,Boujemaa N. Using Fuzzy Histogram and
Disdances for Color Image Retrival. Challenge of Image
Retrival, 2000.
REFERENCES [17] Jie Yuan, Xinzhong Zhu, Huiying Xu. An Improved
Color-Based Image Retrieval Method [J]. Computer
System & Applications.2009,1(2): 37-41.
[1] Swain M J, Ballard D H. Color Indexing [J] .International [18] Chinese Legal System .The Patent Law of China [M].
Journal of Computer Vision. 1991, 7(1) :11-32 . 2004.
[2] W.Niblack,etal. The QBIC project:querying images by
content using color, texture and shape [J].
SPIE,1993,1908:173-187.
[3] C.Faloutsos, W.Equitz, M.Flickner etal. Efficient and
Effective Querying by Image Content [J]. Journal of Zhiyuan Zeng The professor of Digital Engineering and
Intelligent Information Systems,1994,3:231262. Simulation Research Center, Huazhong University of Science
[4] J.Smith, S.F.Chang. A fully Automated Content-Based and Technology, his research area is image processing and
ImageQuery System [C]. Proc of the 4th ACM Multimedia computer Pattern Recognition, database and computer network.
Conference.Boston,1996: 87-98. He has published many papers in these research areas, and he
[5] John M Zachary JrSihtarama S lyengar. Content-based also has done a lot of approaches on the National key scientific
Imgae Retrieval Systems [C]. IEEE Proceedings of the research projects.
1999 IEEE SymPosium
[6] Qingyun Dai, Haipen Li. Studies on the Retrieval of
Design Patent Images [J]. Computer Engineering and
Application, 2002,3(1):27-29.
Wenli Yang The PhD student of of Digital Engineering and
[7] Rafael C.Gonzalez, Richard E.Woods, Steven L.Eddins, Simulation Research Center, Huazhong University of Science
Didital Image Processing Using MATLAB [M]. and Technology, her research area is image processing and
Publishing House of Electronics Industry,2005. computer Pattern Recognition, She has published 6 papers
[8] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and during her phD research period., and she also has one year
Object Recognition Using Shape Contexts [J]. IEEE studying in the Department of Computer Science University of
California, Davis, CA.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1187

Image Enhancement Based on Selective - Retinex


Fusion Algorithm
Xuebo Jin
College of Computer and Information Engineering, Beijing Technology and Business University, Beijing, China
Email: xuebojin@gmail.com

Jia Bao and Jingjing Du


College of Informatics and Electronics, Zhejiang Sci-Tech University, Hangzhou, China
Email: baojia@zstu.edu.cn, jingjdu@163.com

AbstractThe brightness adjustment method for the the visible image has much more high-frequency
night-vision image enhancement is considered in this information of the background which has rich colorful
paper. The color RGB night-vision image is information. In this method, the appropriate reference
transformed into an uncorrelated color space--- the visible images must be selected to get the suitable
YUV space. According to the characteristics of the colorful image. The basic rules of those methods are
night-vision image, we develop the modified Retinex mapping the IR image and visible image into the three
algorithm based on the S curve firstly, by which the components of the RGB image by a certain rules to
luminance component is enhanced and the brightness obtain the false color image and transferring the mean
of the night-vision image is effectively improved. Then and standard deviation of the selected natural day-time
the luminance component of source image is enhanced color image to the false color image for each channel in
by the selective and nonlinear gray mapping to retain an appropriate color space. However, the obtained color
the essential sunlight and shade information. Based on can not be similar to the true scene. While for auxiliary
the two enhancement images, the night-vision image driving systems, it is very important to get the true scene
with enough bright and necessary sunlight and shade while driving. The more similar image to the true scene
information is combined by the weighted parameter. means the more reasonable information delivered to the
According to experimental results, the night-vision drivers, and it can make the driver safer while driving.
image obtained is very fit for the visual observation. Several techniques have been proposed to enhance the
brightness night-vision image. One method is the
Index Termsimage enhancement, night-vision image, histogram equalization(HE)[5-6]. The traditional HE
Retinex, the S curve, selective and nonlinear gray mapping strategy usually produces annoying artifacts and
overstated contrast that makes the image unnatural. It will
ignore a lot of detailed information and bring in the noise,
I. INTRODUCTION so it does not fit for the image with the great contrast and
Statistics show that the traffic accidents have been visual observation.
increasing during the night over the past decade. In recent Another bright adjustment method described is Retinex
years, the night auxiliary driving systems, which are algorithm[7-9]. The idea of the Retinex is to decompose
based on the image processing technology, have become the source image into two different images, i.e., the
the hot spot of the driving information technologies. luminance component image and the reflection
With the street lights, the night-vision images have the component image. The reflection component image is the
following characteristic. They are: (1) very lack of color final enhanced image.
information; (2) bright around the illumination sources, This paper proposes a brightness adjustment method
such as street lamps and car lights, but the whole for the night-vision image based on the modified Retinex
illumination is inadequate; (3) very dark where without algorithm. Firstly, the night-vision image is enhanced by
the light. So the night-vision image is needed to reduce the modified Retinex algorithm of the S curve, then the
light contrast between the lighting and shade part and selective and nonlinear gray mapping is applied to
increase the whole luminance to improve the visibility of improve the sunlight and shade information of the night-
night-vision images. vision image.
To get the colorful night-vision image, the source IR
and visible images are fused by color transfer technology II. THE S-RETINEX METHOD FOR NIGHT-VISION IMAGE
to obtain a colorful final image [1-3], which is the so-
called pseudo-color technology. The IR image records A. Retinex algorithm
thermal radiations emitted by objects in a scene which The flow of the Retinex algorithm is shown as in
can capture the heat source material without the light, and Fig.1.The source image is decomposed into two different

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1187-1194
1188 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

images, i.e., the luminance component image and the space to HSV and YCaCb are both nonlinear. While in
reflection component image. This method can eliminate the YUV color space, three spatial components are
the foreground and background luminance influence of uncorrelated, and the transformation from YUV to the
the image, and can have a high dynamic range RGB color space is linear, moreover, it is the main
indoor/outdoor scene. acquisition color space of the color cameras.
Because the image get by camera is in the RGB color
space. First, the RGB night-vision image is need to
transform to the YUV response space. The transform
RGB space to YUV space is written as:

Y (i, j ) 0.299 0.587 0.114 R(i, j )


U (i, j ) = 0.1471 0.2888 0.4359 G (i, j ) (1)

V (i, j ) 0.6148 0.5148 0.1000 B (i, j )
Figue 1. the flow of the Retinex algorithm
where R, G and B denote the spatial components of the
For an image F (i, j ) , the formula is given by: color RGB image differently.
F (i, j ) = R(i, j ) I (i, j ) (2) Now we use the Retinex algorithm to the image get by
night driving systems shown as Fig.2 (a), whose
where F (i, j ) is obtained by observation or sensor luminance component is shown in Fig.2 (b). The Retriex
reception, R(i, j ) denotes the reflection component algorithm in equation (5) is used to remove the luminance
image, and I (i, j ) stands for the luminance component component of Fig.2 (b) and we obtain the result shown in
Fig.2(c). According to the results in Fig.2(c), we can
image. In fact, the reflection component notice that the overall brightness component enhanced by
image R (i, j ) determines the essential property of the the Retinex algorithm is dark and indistinct.
image, and the luminance component image The histogram equalization method can be used to
I (i, j ) determines the maximum dynamic range of the develop the brightness of image and improve the visual
image directly. So it is the essence of the Retinex theory effect, and the result is shown in Fig.2 (d). We can see
to obtain the original property by removing the luminance there is a very bright area in Fig.2 (d) which leads to a
component image I (i, j ) from the image F (i, j ) . significant dividing line and there is some noise dot in the
sky by incorrect processing. It is very dangerous for the
Two-Dimensional Gaussian convolution
driver who is driving at night because it is not conducive
function G (i, j ) could estimate the luminance component
to observe the road information. Therefore, the histogram
image I (i, j ) from the known image F (i, j ) . It is given equalization is not suitable for improving the image of
by: night driving systems.
1
2 2
i +j In this paper, we propose a nonlinear transfer function
G (i, j ) = e 2 2 (3) of the S curve:
2 2
arc tan(a * x b)
where is the standard deviation in Gaussian function. f = 0.5 + (6)
The image enhancement effect is by the standard 2t
deviation directly, that controls how many fine details are where a and b are used to control the curve shape, a stands
left. Choosing should satisfy with below condition: for the growth speed, t determines the final value. As
G(i, j )didj = 1 (4) shown in Fig.3, it could be found that the S curve
characteristic is very obvious. b determines the growth
The Retinex output, the reflection component region position of the curve. a determines the growth
image R(i, j ) , is given by: region slope. The S curve has a relatively fast growth in
the beginning and then will have a weak growth until
log R (i, j ) = log F (i, j ) log [G (i, j ) F (i, j ) ] (5) reaching a certain constant. Its derivative is smooth, so it
where F (i, j ) is the Retinex input image, I (i, j ) is the could fit reality better.
The modified Retinex algorithm by using of the S
luminance component image, G (i, j ) is the Two- curve is given by:
Dimensional Gaussian function, stands for the arc tan(a * R (i, j ) b)
convolution operation, (i, j ) denotes the coordinates of PR (i, j ) = 0.5 + (7)
2t
the pixels. where R(i, j ) is the reflection component image.
To enhance the luminance is the key for night-vision
image, so the YUV color space is chosen in this paper. B. Experimental results
There are some other color space, i.e., RGB, HSV, Then the modified Retinex algorithm with the S curve
YCaCb and YUV, etc. In the RGB color space, three in equation (7) is carried out on the Y component. The
spatial components are correlative. Though in the HSV image output is shown as Fig.4 with
and YCaCb color spaces, the components are not b = 0, a = 0.7, t = 1.5 . Finally three components YUV
correlative, but the transformation from the RGB color

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1189

are transformed to the RGB image as shown in Fig.5, the Figure 2. (a) the source image; (b) the brightness component
transformation formula is written as the following of the source image; (c) the result by the Retinex algorithm; (d)
the result by the histogram equalization method

0.9

0.8

0.7 b=0 a = 0.7; t = 1.5

0.6

0.5

0.4

0.3

0.2 b=3

0.1

(a) 0
0 2 4 6 8 10 12 14 16

(a)

1.2

1.1
a=3
1

0.9

a = 0.5
0.8

b = 0; t = 1.5
0.7

0.6
(b)
0.5
0 2 4 6 8 10 12 14 16

(b)

0.95
t = 1.5
0.9

0.85

0.8

0.75

0.7

0.65
t=4 b = 0; a = 0.7
0.6

0.55
(c)
0.5
0 2 4 6 8 10 12 14 16

(c)

Figure 3. S-curve in different conditions


(a) about b
where a = 0.7; t = 1.5; b = 0, 0.5,1,1.5, 2, 2.5,3
(b) about a
where b = 0; t = 1.5; a = 0.5, 1, 1.5, 2, 2.5, 3
(c) about t
where b = 0; a = 0.7; t = 1.5, 2, 2.5, 3, 3.5, 4

(d)

2012 ACADEMY PUBLISHER


1190 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

information, the visual effect is not good. So it is


necessary to keep the shade and color information on a
basis of the bright enhancement to make the result image
more suitable for observation.
In general, the image is get by camera with light, so
with lots of three-dimensional and shade information. But
here we deal with the night-vision image without enough
light, so the source image looks very black. One method
to enhancement the image is the histogram equalization
(HE) shown in Fig.2 (d). The traditional HE strategy
usually overstates contrast and ignores a lot of detailed
information.
We know for the night-vision image with little light,
the light and dark part of image need different processing:
the light part need to reduce the brightness and its halo,
Figure 4. the brightness component output by the S-Retinex while the dark part need to be enhanced according to the
method distance between the light source to keep distance
information. The main reason that the traditional HE did
not deal with the two parts differently.

III. THE SELECTIVE AND NONLINEAR GRAY MAPPING


Here we need the following four steps:
Firstly, find the light sources in the image. In the
general, they are street and car lamp. Draw the
component above the 80 percent of the maximum
luminance, erode first to eliminate speckles, then dilate to
recover the area, we can obtain the point light
sources Pn , n = 1 N .
Secondly, reduce the halo. Get the center coordinates
and the luminance of the point light sources, and compute
(a) the luminance-enhanced factor related to the distance
from the center of the point light source f
T i, j:

c i - i 2 +j - j ) 2
f i, j= min exp - 0n 0n
,n = 1 N
T
Mn

(9)
where M n is the area of each point light source.
i0n and j0n stand for the X-coordinate and Y- coordinate
of the center of each point light source. c is the
undetermined coefficient.
Thirdly, deal with the two parts differently. Compute
the luminance-enhanced factor related to the
(b)
Figure 5. (a) the source image; (b) the color image output by the luminance f L (i, j ) :
S-Retinex method 1 in the area of each point light source
f L (i, j ) =
d ( p(i, j ) Light ) + 1 other parts in the image
2

R (i, j ) 1 0 1.140 Y (i, j ) (10)


G (i, j ) = 1 0.394 0.581 U (i, j ) where p(i, j ) is the gray value of (i, j ) , Light is the 80
percent of the maximum luminance. d is 6 if the whole
B (i, j ) 1 2.032 0 V (i, j )
image is very dark (for example the luminance average is
(8) less than 0.15), otherwise is 3.
According to the result, the image contrast is At last, the luminance component of the whole image
enhanced, and the brightness of image is improved. The is enhanced by
details of the scene are clear, so we can see the grass on pT (i, j )=p(i, j ) f L (i, j ) f
T i, j (11)
the roadside, the building and the door. But we can also
see the shade information is eliminated. Because the where f L (i, j ) is the luminance-enhanced factor related
image is lack of the necessary three-dimensional to the luminance, and f T i, jis the luminance-enhanced

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1191

factor related to the distance from the center of the point image is proposed to obtain the image with enough bright
light source. by using S curve which has a relatively fast growth in the
By the method proposed here the image shown in Fig.6
(b) can be obtained from the luminance image in Fig.6 (a)
with light and three-dimensional information.

VI. SELECTIVE-RETINEX FUSION ALGORITHM


Compared Fig.6(b) with Fig.4, it could be found that
the former has sunlight and shade information, but
because of the limited lumination distance of the car
lights, it is extremely dark in farther places; however the
latter eliminates the sunlight and shade influence and is
not fit for the visual observation, because lack of depth
information.
We get the final luminance image pY (i, j ) by combine
the two enhancement image get by (7) and (11) with the (a)
weight g
pY (i, j ) = g pR (i, j ) + (1 g ) pT (i, j ) (12)
where g is the weight. The value of g will decide the
fusion image performance greatly. With the different
value of g , the different fusion images are shown in
Fig.8. By large number of experiments, we know that if
there are good light condition and visual observation
distance, the best range for g is from 0.1 to 0.3;
otherwise it is from 0.3 to 0.6. pR (i, j ) is the gray value
by the luminance adjustment method for the faint image
based on the modified Retinex algorithm of the S curve
and pT (i, j ) is the gray value based on the selective and (b)
Figure 6. (a) the luminance image of the source; (b) the
nonlinear gray mapping. Choosing the weighted luminance image by the method
parameter g properly could retain the essential sunlight
and shade information and make the image fit for the
visual observation, as shown in Fig.8.
We use the color transfer algorithm in refers [1,3] by
the false color fusion method to the source image in Fig.9
(a). The principle of color transfer algorithm [1,3] needs a
reference image and is implemented to make the final
image and the reference image have the same mean and
standard deviation for each channel in an appropriate
color space. Here, the color space for the color transfer is
also the YUV space. From the results compared with
Fig8. (b), we can see that in Fig.9(c) the whole luminance
is inadequate, and the target and background are not clear
enough. In Fig.9(d), the reflection of the lights is so (a)
strong that it is hard to observe.
We apply the method developed here to several
images get from the night-vision driving system. The
results are shown in Fig. 10. We can see the bright of all
the images are developed and the details can enhanced.
The output results all show satisfactory effect by the
selective-Retinex fusion method.

V. CONCLUSION
In this paper, we study the night-vision image in the
YUV space and find the enhanced luminance component
of the night-vision image is dark and indistinct. The
Retinex brightness adjustment algorithm for night-vision (b)

2012 ACADEMY PUBLISHER


1192 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

(c) (g)

(d) (h)

(e) (i)
Figure 7. the fusion image with different g
(a) g = 0.1 (b) g = 0.2 (c) g = 0.3
(d) g = 0.4 (e) g = 0.5 (f) g = 0.6
(g) g = 0.7 (h) g = 0.8 (i) g = 0.9

(f)

(a)

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1193

(d)
Figure 9. (a) the indistinct RGB night-vision image; (b) the
(b) reference image; (c) the final fused image[3]; (d) the final fused
Figure 8. (a) the luminance component of the image pY (i, j ) (b) image[1].
the final image with color
beginning and then will have a weak growth until
reaching a certain constant. In another way, the
luminance component Y is enhanced by the two
luminance-enhanced factors, to obtain the image
necessary sunlight and shade information. Then with the
weight, the former two images are fused to the final
enhanced luminance image. By which, the color image is
obtained with good visibility.
By the method developed here, the image brightness
could be enhanced effectively, and the necessary sunlight
and shade information could be reserved very well. It is
suitable for the drivers to observe the road situation when
traveling.

(a)

(b) (a)

(c)

(b)

2012 ACADEMY PUBLISHER


1194 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

contrast enhancement, IEEE Trans. Image Processing,


vol. 18, pp. 9211935, 2009.
[7] Junxuan Yan, Ke Zhang. Adaptive color restoration and
luminance MSR based scheme for image enhancement, 2
nd International Conference on Advance Computer
Control (ICACC), pp. 185188, 2010.
[8] In-Su Jang, Tae-Hyoung Lee, Ho-Gun Ha, Yeong-Ho Ha.
Adaptive color enhancement based on multi-scaled
Retinex using local contrast of the input image,
Intrnational Symposium on Optoemchatronic
Technologies(ISOT), pp. 16, 2010.
[9] Nian Liu, Xiuyuan Peng. Multiscale retinex color image
recovery enhancement algorithm, Control and Decision
Conference, pp. 39763979, 2009
(c)

Xue-bo Jin was born in Liaoning in 1972, China. She


received the B.E. degree in industrial electrical and automation
and the Masters degree in industrial automation from Jilin
University, Jilin, China, in 1994 and 1997, and the Ph.D. degree
in control theory and control engineering from Zhejiang
University, Zhejiang, China, in 2004.
From 1997 to 2011 she was with College of Informatics and
Electronics, Zhejiang Sci-Tech University. Since 2011 she has
been with College of Computer and Information Engineering,
Beijing Technology and Business University as a Professor.
(d) Prof. Jin is the author of the article Multisensor fusion
Figure 10. the results of this paper method;(a) (c): the color estimation applied in state monitoring (Control theory &
night-vision images in kinds of night road conditions; (b) (d): applications, 2009) and Application of Multisensor State Fusion
the results by the luminance adjustment method for the night- Estimation in Estimation of Paper Basis Weight (Systems
vision image based on the method developed here. Engineering - Theory & Practice, 2005). Her research interests
include statistical signal processing, video imaging, robust
filtering, other stochastic recursive algorithms and their
ACKNOWLEDGEMENTS applications in estimations for dynamic system. In particular,
her present major interest is multisensor distributed estimation
This work was supported by the National Natural and decision fusion.
Science Foundation of China under Grants No. 60971119

REFERENCES Jia Bao was born in 1978. She received the B.S. degree in
applied electronic technology from Anhui Normal University in
[1] Alexander Toet. Natural color mapping for multi and 2000. In 2003, she received the M.S. degree in power
night vision imagery, Information Fusion, vol. 4, pp. 155 transmission from Southwest Jiaotong University. Currently,
166, August 2003. she is a lecture in College of Informatics and Electronics,
[2] Kai-Qi Huang, Qiao Wang, Zhen-Yang Wu. Natural color Zhejiang Sci-Tech University. Her main research interests
image enhancement and evaluation algorithm based on include signal processing, power electronics.
human visual system, Computer Vision and Image .
Understanding, vol. 103, pp. 5263, 2006.
[3] Songfeng Yin, Liangcai Cao, Yongshun Ling, Guofan Jin,
One color contrast enhanced infrared and visible image Jing-jing Du was born in 1977. She received the B.S. degree
fusion method, Infrared Physics & Technology, vol. 53, in Information Engineering from Harbin Engineering University
pp. 146150, 2010. in 1999. In 2006, she received the M.S. degree in circuit and
[4] Guy Gilboa, Nir Sochen, Yehoshua Y.Zeevi. Image system from Hangzhou Dianzi University. Currently, she is a
enhancement and denoising by complex diffusion lecture in College of Informatics and Electronics, Zhejiang Sci-
processes, IEEE Trans. PAMI, vol. 26, pp. 10201036, Tech University. Her main research interests include signal
2004. processing, image processing.
[5] Yen-Ching Chang, Chun-Ming Chang. A simple
histogram modification scheme for contrast enhancement,
IEEE Trans. Consumer Electronics, vol. 56, pp.737742,
2010.
[6] Tarik Arici, Salih Dikbas, Yucel Altunbasak. A histogram
modification framework and its application for image

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1195

Research on Detectable and Indicative Forward


Active Network Congestion Control Algorithm
Jingyang Wang
Hebei University of Science and Technology, Shijiazhuang, China
Email: ever211@163.com

Min Huang
Hebei University of Science and Technology, Shijiazhuang, China
Email: huangmin@hebust.edu.cn

Haiyao Wang
Fujian Jiangxia University, Fuzhou, China
Email: why-helen@163.com

Liwei Guo and Wanzhen Zhou


Hebei University of Science and Technology, Shijiazhuang, China
Email: {guoliwei, houwz}@hebust.edu.cn

AbstractIn order to solve the shortages of Forward Active FACC [2],[3] uses active network technology to make
network Congestion Control algorithm (FACC), this paper feedback congestion control more responsive to network
proposes a Detectable and Indicative Forward Active congestion. But FACC still has some problems to be
network Congestion Control algorithm (DIFACC). DIFACC resolved. The mainly shortages of the FACC are as
uses RED buffer queue management algorithm instead of
follows.
standard drop tail algorithm to increase the usage of
bandwidth. It preserves the active detection and passive FACC is a kind of congestion control algorithm
indication mechanism to realize load balance. It introduces based on passive reaction. It merely relieves the
sending speed adjustment policy, and designs different network congestion passively. FACC doesnt take
processing method for different kinds of service data any actively preventive measure on network
according to their different characteristics and requests on congestion according to the current network
network resources. DIFACC can not only relieve congestion condition.
in time when congestion happens, but also can avoid the
congestion and increase the QoS of the different kinds of The transmission priority of indication message is as
service data. A simulation experiment is given to analyze the the same as other packets. In this case, when the
performance of the algorithms. With the analysis of network node is at the congestion state, congestion
performance between FACC and DIFACC, it shows that control will be done immediately to control the
DIFACC not only resolves some shortages in FACC, but congestion. This will lead to the loss of indication
also improves the OoS of different kinds of service data, message. Thus FACC will be invalid.
reduces the package loss rate and decreases the processing
delay of packets. Network node queue management uses standard
drop tail algorithm [2]. When network node queue is
Index TermsDIFACC, congestion control, service data, full, the node will discard all the data packets
active network arriving at later. This will lead to a sharply loss on
data packets coming from the same data stream.
FACC can not ensure the Quality of Service (QoS)
I. INTRODUCTION of different kinds of service data, because it
Active Network is an advanced research area on processes different kinds of service data which has
network in the world in recent years. It is a new network different characteristics and requests on network
technology based on Internet. Active Network was put resources using the same method.
forward by Defense Advanced Research Projects Agency This paper puts forward DIFACC to solve the
(DARPA) in the discussion on development direction of problems in FACC. DIFACC has improved upon FACC.
network system. With the rapid development of Active It introduces the active detection and passive indication
Network research, the research on active network mechanism and combines with RED queue management
congestion control is becoming increasingly important [1]. [5] and the load balance technology. According to the
As a congestion control algorithm in Active Network, transmission characteristics of different kinds of service

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1195-1202
1196 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

data, DIFACC introduces the concept of priority and uses node will take some measures to relieve the network
different process methods to ensure the QoS [6]. congestion. In fact, FACC is a kind of congestion control
policy based on passive reaction. Indication message is
II. DIFACC sent at the congestion state. It can only indicate the
congestion condition rather than prevent or predict the
A. RED Buffer Queue Management Algorithm congestion. DIFACC preserves this passive indication
Compared with FACC, RED buffer queue message in FACC and then introduces the active
management algorithm can be used to manage network detection and passive indication mechanism. In other
nodes queue in DIFACC. The congestion condition of words, DIFACC is a kind of preventive congestion
network nodes can be predicted by computing the buffer control policy.
queue length with RED. The RED algorithm can control Active network consists of a group of network nodes
the buffer queue length of the router to a relative lower which are called active nodes. Each active node can be a
value, receive the data packets instantly and prevent data router or a switch. The active nodes comprise the
packets from the burst loss. So that it increases the usage execution environment of the active network. In DIFACC,
of the bandwidth. the congestion information of each active node is given
The RED algorithm uses a moving weighted average in Table.
function to evaluate the buffer queue length and then
TABLE I.
predicts the congestion condition. The RED algorithm in CONGESTION INFORMATION TABLE
DIFACC consists of two parts.
First part is to detect the initial network congestion. Active node IP Update Time Queue Length
When a new data packet arrives, the buffer queue length 192.168.1.2 09:26:54 130
can be calculated by a moving weighted average function 192.168.1.3 09:26:30 50
[5] as follows:
avg (1 wq) * avg wq * q (1) While starting, active nodes will send detection
In this function, wq is the moving weighted average messages to all the neighboring nodes. After receiving the
value; q is the current buffer queue length. RED messages, the neighboring nodes calculate the buffer
compares the buffer queue length with two preplaced queue length using RED algorithm, put the results into
threshold parameters: min_th and max_th, and the header of Active Network Encapsulation Protocol
distinguishes whether the network is in congestion or not. (ANEP) message and send it back to the source nodes, so
The second part is the congestion control rule. If avg that the active nodes can grasp the congestion
min_th and max_th, the network is at the initial information of the neighboring nodes and then create the
congestion state or the congestion avoidance state, and table of congestion information.
RED algorithm discards the data packets randomly with In congestion information table, each record has an
the probability Pa which is approximate in the buffer update time field to describe whether this record is latest
queue length function. or not. Compared the update time with a fixed threshold
avg min_ th Tm, if it is less than Tm, we believe that this record is
Pb * max p (2) latest and effective. If not, we think that this record has
max_ th min_ th
been obsolete.
pb (3)
pa Before they transmit the data, active nodes inquire the
1 n * pb congestion information table whether the records are
In this function, maxp is the largest probability of latest firstly. If not, active nodes send detection message
dropping packets; n is the number of data packets from immediately to the neighboring nodes to update the
the previous time discarding the packets to current time. records, and then find out the node whose buffer queue
If avg is more than max_th, the network will be at length is lowest. It indicates that the network resource of
congestion state and the router will discard each packet this node is most sufficient. And this node is the right one
newly arriving with the probability. that data packets are sent to.
With the description of all above, if avg is more than When the network is busy, the frequent transmissions
or equal to min_th and less than or equal to max_th, the of active detection messages can not avoid the congestion.
network is at the initial congestion state or the congestion On the contrary, it may add the burden of network and
avoidance state, and RED algorithm discards the data lead to more congestion. In order to avoid sending
packets randomly with the probability Pa which is detection message so frequently, active node encapsulates
approximate in the buffer queue length function. If Pa is its own buffer queue length value into the ANEP message
more than or equal to 0 and less than 1, data packets are with the transmission of data packets to all the nodes [6],
not to be discarded. Except this region, the data packets [1] in order to update their tables. Furthermore, when the
are to be discarded. network node is at the congestion state, it will send the
passive indication message to the source node. As same
B. Realizing Load Balance
as the service data packets, active node also encapsulates
In FACC, when the network node is at the congestion the buffer queue length value into the header of the
state, it will send indication message immediately to the ANEP message. The network node which receives these
source node. After receiving the message, the source two kinds of message (active detection message and

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1197

passive indication message) can update the congestion (7) Sending indication packet to update congestion
information table. In brief, to update the congestion information table.
information table has two methods: active detection and (8) Updating congestion information table.
passive indication. However, when active node starts and
C. Setting the Priority of Service Data
the network is quite idle, the active detection message
will be more. When the efficiency of use network According to the different characteristics and demands
resources is high, service data packets and the passive for network resources of the various types of service data
indication message (only at the congestion phase) are which is transmitted in the network, DIFACC algorithm
used to update the table. Due to the congestion designs different priorities. It implements that the active
information tables, active nodes can choose the optimum node uses different priority to process the congestion
route [10] to transmit data, reduce the package loss rate control based on different service data type. DIFACC can
and realize the load balance. The process of realizing load not only relieve congestion in time when congestion
balance in active nodes is shown in Fig.1. DIFACC can happens, but also can ensure that the special service data
prevent or predict the congestion in time. So compared which has the higher priority can achieve the higer QoS.
with FACC, DIFACC is a kind of preventive congestion According to the different data types tranmitted in the
control policy. network, the service data is abstracted by four types,
The steps of realizing load balance are as follows. which are video data, audio data, file data and message
(1) When active node starts, it sends indication packet data. DIFACC implements that the active nodes discard
to neighbor nodes to initialize congestion information the data packets according to different kind of service
table. data and different priority. It also implements that the
(2) Active node is at working states (listening states). active node automatically chooses the congestion control
(3) If active node receives transmission data, it judges algorithm which is suitable for corresponding service to
process congestion control according to the different
services.
In the FACC algorithm, when the congested node
detects the congestion, it will process congestion control
immediately; accordingly the load condition of the
congested node will be relieved. The indication message
packet actually plays a role in promptly responding to the
congestion condition on the whole transmission process.
However the indication message packet may be missed
during transmission, the sending source node can not
make necessary adjustments according to the current
network usage, eventually FACC will be invalid. The
priority of indication message packet is set the highest in
DIFACC. The network nodes directly put the indication
message packet at the head of the buffer queue to treat
them preemptively, and ensure that the indication
message packet is transmitted securely. So the problem of
the algorithm invalidation which is caused by the loss of
the indication message packet during transmission is
solved. The priority of detection message packet is also
set the highest. Indication message packet and detection
message packet are collectively called notification packet.
D. Sending speed adjustment
When the active node is at the congestion state, it will
randomly select and discard the data packet which has the
Figure 1. The process of realizing load balance
lower priority, as well as will send the reduction speed
notification, and it will be need to resend data for the non
whether the congestion information is the latest, if not, real-time data. The sending speed adjustment procedure
includes the reduction sending speed process and the
goes to 7.
increase sending speed process. When the active node
(4) Active node finds out the active node which buffer
queue length is the shortest, then transfers the service congestion condition is relieved, the terminal equipment
data to that node. which has reduced sending speed can increase the
sending speed initiatively.
(5) Active node judges whether the received data is
The mainly steps are shown in Fig.2.
the notification of decreasing speed, if yes, go to 8.
(6) Active node judges whether the transferring ends, if (1) Define variables. The varialbes include the time of
yes, go to 2, otherwise, go to 4. sending data Ts, the time of receiving the reduction speed
notification Tr, the time of receiving the reduction speed
notification last time Tr1, the count of consecutively

2012 ACADEMY PUBLISHER


1198 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Start

Define variable

Calcute RTT

Send data
record sending time

No Receive reduction speed


notification in a RTT time?
Check congestion
information table Yes

Update congestion
Yes information table
congested

No
No
Nr 0
speed=speed+
increment RTT/2<
Yes No
Current timeTrl
<RTT
No
speed>MAXspeed
Yes Nr=0
Yes Nr++ Trl=current time
Trl=current time
speed =
MAXspeed

speed=speed*(1-Rq*Nr)

No
speed<MINspeed

Yes

speed =
MINspeed

Figure 2. The process of the sending speed adjustment

receiving the reduction speed notification Nr, the average to step 3, otherwise, speed=MAXspeed, continue to send
round-trip time RTT, the reduction ratio Rq. data, go to step 3.
(2) Calculate the average round-trip time RTT. (8) Update the congestion information table.
(3) Send data and record the sending time Ts. (9) Judge whether the count of consecutively receiving
(4) Check whether the terminal node receives the the reduction speed notification is 0. If not 0, go to step
reduction speed notification packet during a RTT time. If 13.
yes, turn into the reduction speed process, go to step 8, (10) Nr increases 1, record the current time as Tr1.
otherwise, turn into increase speed process, and go on. (11) Calcute speed=speed*(1-Rq*Nr).
(5) Check the congestion information table, judge (12) Judge whether speed is less than the speed lower
whether the active node is at the congestion state. If yes, limit, if yes, set the sending speed as the MINspeed,
the sending speed should not be changed, continued continue to send data, go to step 3, otherwise, the sending
sending data, and go to step 3. speed will not be changed, continue to send data, go to
(6) If not at the congestion state, speed=speed+ step 3.
increment. (13) Judge whether RTT/2<The current Time-
(7) Judge whether the speed is greater than the speed Tr1<RTT. If so, go to step 11, otherwise, set Nr as 0, Tr1
upper limit MAXspeed, if not, continue to send data, go is the current time, go to step 3.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1199

Through the above algorithm, the sending node can When network node discards audio packets, it will only
know that the transferring nodes get congested, can notify the source node to decrease the speed of
automatically adjust sending speed which can timely transmitting data and not need to resend the packet.
reduce the burden of active nodes, and can effectively 3) File Data: File data does not need to be
relieve the congestion conditions. The sending node can transmitted on real-time, but it need to be transmitted
increase speed automatically to ensure the data reliably because the file can not run correctly when any
transmission timely after the active nodes congestion is part of it is lost, so we allocate the lower priority to file
relieved. data. When network node discards file data packets, it will
E. Processing Service Data notify the source node to decrease the speed of
According to the different characteristics and request transmitting data and to resend the packet.
of the different kinds of service data on network 4) Message data: Message data does not need to be
resources, different kinds of processing methods are transmitted on real-time and has little relationship among
designed for them. In this paper, active network transmits different packets, but it need to be transmitted reliably, so
four kinds of service data, such as video data, audio data, we allocate the lower priority to message data. When
file data and message data. network node is at the congestion state and discards
1) Video Data: Video data transmission in network is message data packets, it will notify the source node to
usually used MPEG format. MPEG format video data has decrease the speed of transmitting data and to resend the
three kinds of frames, the first one is I frame, the second is packet.
P frame, and the third is B frame. One graphics group F. Customization ANEP Protocol
consists of one I frame, several P frames and several B
According to the service demand, we customize
frames. The important characteristics of those frames are
several option fields of the ANEP protocol which is
shown as follows.
shown in Table II.
I frame is the most important in three kinds of
While the terminal equipment is transmitting data to
frames, because it can be compressed or
decompressed correctly not depending on other the network, firstly it completes the active packet
frames. If I frame is lost, other frames data in the encapsulation according to the ANEP protocol. The
same graphics group such as P frame and B active node receives the active packets and gets active
frame can not be compressed or decompressed packets related information through analysing and
correctly. calculating its related fields, and complements the service
data retransmission. The several important fields settings
P frame is more important than B frame. In same are shown as follows.
graphics group, P frame can be compressed or (1) The identification of service data and the
decompressed after the I fame and P frames congestion control algorithm are allocated automatically.
whose position are more anterior have been The option field of the service type has four kinds of
compressed or decompressed correctly. So the service code: 1 represents real-time voice; 2 represents
former P frame is more important than the one
ordinary message data; 3 represents real-time video; 4
after in same graphics group.
represents file. After identifying the service type, the
B frame is less important than any other frame; active node could automatically allocate the related active
the importance of B frame data in different congestion algorithm according to the service type as
position is same. soon as the network congestion happens.
According to the importance of different kinds of (2) The setting of the service priority. According to the
frames in MPEG format video data, we design the rule of four kinds of service having the different transmission
discarding pocket as follows. When network node is at mechanism and the requirements for QoS, The different
the congestion state, it judges the type of video frame in priority is set to them, which is used for the randomly
packet firstly, if the type of video frame is I frame, selecting and discarding in RED algorithm, and is used
discards the current packet and all the packets after it for ensuring the transmission of special packets. Because
until next I frame packet reaches. If the type of video the transmission of the video and audio service data is
frame is P frame or B frame, we only discard the current real-time and consecutive, once the data packet is lost,
packet. there is little significance for resending, higher priority is
When network node discards video packets, it will set to them. The file service priority is lower. Because the
send indication message immediately to the source node. message service data transmits less data each time, and is
Receiving the message, the source node will only uneasy to generate lots of data packets sudden drop, its
decrease the speed of transmitting data, not need to retransmission and restore is better, the lowest priority is
resend the packet because video data need real-time set to it. To avoid the notification packet going missing
processing. during transmission and making the algorithm ineffective,
the highest priority is allocated for various kinds of
2) Audio Data: Different packets of Audio data are
notification packets. Security transmission is ensured
irrelevant, but it needs to be transmitted on real-time. In
while the congestion happens.
order to ensure the sound can be played fluency and
clarity, we allocate the higher priority to the audio data.

2012 ACADEMY PUBLISHER


1200 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

(3) The mark of the notifying source node. Its total speed and resending message; 2 represents detection
length is 32 bits. The first 16 bits are the length of the message. The forth byte represents whether need
current weighted queue, which is used to record the resending. 0 represents no resending; 1 represents
length of the weighted queue of the current active node resending.
G. Design of code server
TABLE II.
INFORMATION OF CUSTOMIZATION OPTIONAL FIELDS Active code is stored in the form of .class file in the
Option
code library of FTP server. We use the Internet
Defination Option Values Information Services (IIS) of windows to realize FTP
Type
The first 32 bits represent server in which many active network congestion control
mechanism identifier. algorithm is stored. When corresponding active network
1 indicates IPv4 address (32 bits);
1 Source identifier
2 indicates IPv6 address (128 bits);
congestion control algorithm cant be found in active
3 indicates 802.3 address (48 bits); node local, active node sends request to FTP server to
The last 32 bits represent IP address. download the algorithm active code, then loads the
Destination dentifier The first 32 bits represents algorithm dynamically using the loadClass method of
(representing the mechanism identifier.
ClassLoader class.
relayed active node) 1 indicates IPv4 address (32 bits);
2
2 indicates IPv6 address (128 bits);
3 indicates 802.3 address (48 bits); . SIMULATION AND PERFORMANCE ANALYSIS OF
Tthe last 32 bits represent IP ddress. DIFACC
3 Integrated checksum currently unavailable

4
Non-negotiable currently unavailable A. DIFACC Simulation Topology Structure
authentication
Final destination Its total length is 64 bits.The first 32 This paper establishes a simulation experiment system
identifier bits represent mechanism identifier. and analyzes the performance of the two algorithms. The
5
1 indicates IPv4 address (32 bits); DIFACC simulation topology structure is shown in Fig. 3.
2 indicates IPv6 address (128 bits); T1 and T2 are terminal devices, which can be used to
3 indicates 802.3 address (48 bits);
The last 32 bits represent IP address. send / receive data in the simulation experiment system.
Chose of the Its total length is 64 bits. The first 8 R1, R2, R3 and R4 are active nodes, which simulate route
congestion algorithm bits represent the number of the function and transmit data between T1 and T2. In order to
congestion algorithm. The other 56
6
bits represent the name of the class
which realizes the algorithm. (The
max length of the name is 7 letters.)
Service type Its total length is 32 bits.
1 represents real-time voice;
7 2 represents ordinary message data;
3 represents real-time vedio;
4 represents files
Priority Its total length is 32 bits; the values
8
contain 1, 2, 3, 4 and 5. Figure 3. The DIFACC simulation topology structure
Notifying source node Its total length is 32 bits.
The first 16 bits are the length of the compare clearly, FACC and DIFACC algorithms are used
current weighted queue. in the same experimental environment and data.
The third byte represents message
type.
0 represents that it is not the
reduction speed resending message;
9
1 represents the reduction speed
resending message;
2 represents detection message.
The forth byte represents whether
need resending.
0 represents no resending;
1 represents resending. Figure 4. Effect of transferring video in DIFACC
Filename Its total length is 32 bits.
The first 16 bits are the automaticlly
B. Analysis of Service Data Transmission Effect
increasing file index value, which When network transmits video data in a man-made
uniquely determines an object of congestion condition, DIFACC algorithm applies service
10
Class Item, including ID, strDestIP,
strFileName. priority and relativity to realize discard the selected
The last 16 bits are the file sequence packet. Video transmitting effect applied DIFACC
number seqID. algorithm in congestion condition is shown in Fig.4. It
shows that the whole effect can be maintained and the
while the packet arriving. It prevents and avoids the image is clarity relatively, however, real-time of image
congestion occurrence through the active detection and becomes worse and produces dithering. In the same
passive indication mechanism. The third byte represents environment, Fig.5 illustrates the transmitting effect
message type. 0 represents that it is not the reduction applied FACC algorithm, it shows that the image is
speed and resending message; 1 represents the reduction illegibility and the quality is very poor.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1201

In the same way, when network transmits audio data in


a man-made congestion condition, it can be ensured to be
transmitted reliably, because DIFACC algorithm
allocates higher priority to audio data. When network
discards audio packet in congestion condition, the sound
is also clarity and continuous despite the real-time
becomes poor. However, when network applies for
FACC algorithm, the effect becomes worse clearly, the
sound is not continuous as before and many audio data is
discarded in transmission. Figure 8. Packets loss rate in DIFACC

Figure 5. Effect of transferring video in FACC


The active congestion control algorithm applies for
decreasing speed and resending mechanism to process Figure 9. Packets loss rate in FACC
file and message data. When network is at congestion As shown in Fig.8 and Fig.9, packets loss rate curve is
states, the congestion can be relieved by decreasing speed stable and raises slowly using
Figure 9. FACC algorithm. In the
of transmitting data. However FACC applies for drop tail same condition, packets loss rate curve raises faster using
algorithm which does not send message which is used for DIFACC algorithm.
decreasing speed and resending, so it can not relieve
congestion independently and lead to massive loss of
service data, thereby it can not ensure service data
transmit reliably.
B. Performance Analysis

Figure 10. Congestion queue in FACC

Figure 6. Buffer queue length in FACC

Figure 11. Congestion queue in DIFACC

As shown in Fig.10 and Fig.11, the buffer queue length


rises quickly to its maximum using FACC algorithm;
network is congested in short time; data packets are
Figure 7. Buffer queue length in DIFACC massively discarded. In the same condition, using
As shown in Fig.6 and Fig.7, axis of ordinate in Fig.6 DIFACC algorithm, data packets arent massively
is higher than Fig.7 obviously. It means FACC buffer discarded and the buffer queue curve rises slowly in
queue is longer than that of DIFACC. Thus, the buffer respect to FACC. The curve of DIFACC doesnt reach
queue is often longer in FACC. There is a longer waiting the maximum of the buffer queue length but trends to
queue in the nodes transmission. In DIFACC, the buffer stable at the minimum threshold value.
queue keeps short mostly. Therefore, there is a shorter DIFACC introduces the active detection and passive
waiting queue in the nodes transmission so that the indication preventive mechanism. Through this
processing delay of packets in network is decreased. mechanism, active nodes can know the congestion
information in time, actively choose the optimum route to
transmit data, predict the congestion information

2012 ACADEMY PUBLISHER


1202 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

effectively, reduce the packets loss rate, realize load [8] Xu Yongbo, and Wang Xingren, The Research and
balance and increase the efficient use of network Implementation of Time Management Services in
Distributed Simulation System, Proceedings of Asian
resources. Conference on System Simulation and ScientificSimulation
Conference, Shanghai, 2002.
. CONCLUSIONS [9] Appel A W, Foundation Proof-Carrying Code, IEEE
Communication Magnize, 2001.
DIFACC applies different processing methods to [10] MAXEMCHUKNF, and LOWSH, Active Routing,
different kinds of service data (such as video data, audio IEEE Journal on Selected Areas in Communications, 2001,
data, file data and message data) because of the different pp. 552-565.
characteristics and requests on network resources, adopts [11] S.Murphy, Security Architecture for ActiveNets, AN
Security Working group, July 15, 1998.
the cooperation of the load balanced management
[12] A. B. Kulkarni and S. F. Bush, Active network
algorithm and RED queue management algorithm, and management and Kolmogorov complexity, IEEE
introduces the active detection and the passive indication OPENARCH 2001, Anchorage, AK, Apr. 2001.
mechanism. It succeeds in avoiding the algorithm
invalidation caused by the indication message loss,
promotes the efficient usage of network resources. With
the analysis of performance between FACC and DIFACC
in simulation experiment system, it is known that
DIFACC not only resolves the problems in FACC, but
Jingyang Wang, Associate Professor, born in 1971. He
also reduces the packets loss rate, decreases the
received the B.Eng. degree in computer software from Lanzhou
processing delay of packets and promotes the efficient University, China, in 1995. He received the M.Sc. degree in
usage of network resources. software engineering from Beijing University of Technology,
However, when active node starts just now or the China, in 2007. His main research areas include active network,
network is idle, many active detection messages will be transmition control, and distributed computing.
produced, the network resource will be wasted seriously.
At the same time, there are also some limitations in
processing video data, because it only aims at MPEG
format. Min Huang, born in 1979. He received the B.Eng. degree in
automatic control from Hebei University of Science and
Technology, China, in 2000. He received the M.Sc. degree in
ACKNOWLEDGMENT computer science from Beijing Institute of Technology, China,
The authors wish to thank the editor and referees for in 2003. His main research interests include network and
their careful review and valuable critical comments. This communication, system modeling and identification, image
processing and distributed computing.
work is supported by the Science Fund of Hebei of China
No. 11213522D.

REFERENCES Haiyao Wang, born in 1976. She received the B.Eng. degree
in machinery design and manufacture from HeFei University of
[1] Jingyang Wang, Xiaohong Wang et al, The Research of
Technology, China, in 1998. She received the M.Sc. degree in
Active Network Congestion Control Algorithm,
industrial engineering from HeFei University of Technology,
Proceedings of the WiCom2007, September 2007.
China, in 2009. Her main research interests include algorithm
[2] Wang Bin, Liu Zeng-Ji et al, Forward active networks
congestion control algorithm and its performance design and industrial control.
analysis, Acta Electronica Sinica, Vol 29, No. 4, April.
2001, pp. 483-486.
[3] Carlo Tarantola, Dynamic Active Networks Services,
Proceedings of the 2004 IEEE International Conference on Liwei Guo, Professor, born in 1956. He received the M.Sc.
Mobile Data Management, pp. 46-47, June 2004. degree in automatic control from Harbin University of Science
[4] Xu Jiali, and Liu Suqin, Discussing and Implementing and Technology, China, in 1988. His main research interests
Method of Active Network Architecture, Control & include network and communication, system modeling and
Automation, Vol 11, No.2, pp. 232-245, Nov. 2004. automatic control.
[5] LA Grieco and S. Mascolo, TCP Westwood and Easy
RED to Improve Fairness in High-Speed Networks,
Seventh International Workshop on Protocols For High-
Speed Networks (PfHSN'2002), Berlin, Germany, pp. 130
146, April 2002. Wanzhen Zhou, Professor, born in 1966. He received the
B.Eng. degree in applied mathematics from Harbin University
[6] K.Psounis, Active Networks; Applications, Security,
Safety and Architectures, IEEE Communications Surveys, of Technology, China, in 1988. He received the M.Sc. degree in
Vol 2, No. 1, 1999. computer science from Harbin University of Technology, China,
[7] Zhang Ke-ping, and Tian Liao, and Li Zeng-zhi, A New in 1992. His main research interests include network and
Queue Management Algorithm with Priority and Self- database, system modeling and image processing.
Adaptation, Acta Electronica Sinica, Vol 6, No. 4, pp.
324-328, July. 2004.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1203

Makespan Minimization on Parallel Batch


Processing Machines with Release Times and Job
Sizes
Shuguang Li
College of Computer Science and Technology, Shandong Institute of Business and Technology, Yantai, China
Email: sgliytu@hotmail.com

AbstractThis paper investigates the scheduling problem of times of the optimal value. Furthermore, a family of
minimizing makespan on parallel batch processing
approximation algorithms is called a polynomial-time
machines encountered in different manufacturing
environments, such as the burn-in operation in the approximation scheme (PTAS) if, for any fixed 0 ,
manufacture of semiconductors and the aging test operation at least one of the algorithms has a worst-case ratio no
in the manufacture of thin film transistor-liquid crystal more than 1 .
displays (TFT-LCDs). Each job is characterized by a The scheduling problem considered in this paper is
processing time, a release time and a job size. Each machine
can process multiple jobs simultaneously in a batch, as long described as follows: There is a set J {1,2, , n} of
as the total size of all jobs in the batch does not exceed n jobs that can be processed on m batch processing
machine capacity. The processing time of a batch is machines. Each job, j , is characterized by a triple of real
represented by the longest time among the jobs in the batch.
An approximation algorithm with worst-case ratio 2 is numbers ( r j , p j , s j ) , where r j is the release time
presented, where 0 can be made arbitrarily small. before which job j cannot be scheduled, p j is the
Index Termsscheduling, parallel batch processing processing time which specifies the minimum time
machines, makespan, release times, job sizes, worst-case needed to process job j without interruption on any one
of the machines, and s j (0,1] is the size of job
analysis
j.
Each batch processing machine has a capacity 1 and can
I. INTRODUCTION process a number of jobs simultaneously as a batch as
long as the total size of jobs in the batch does not exceed
Batch processing machines are encountered in many 1. The available time and processing time of the batch are
different environments, such as the diffusion and burn-in represented by the latest release time and longest
operations in semiconductor fabrication, heat treatment processing time among the jobs in the batch, respectively.
operations in metalworking, and aging test operations in Jobs processed in the same batch have the same
the manufacture of thin film transistor-liquid crystal completion time (the completion time of the batch in
displays (TFT-LCDs). In these operations, the machines which they are contained), i.e., their common start time
are usually treated as batch-processing machines that can (the start time of the batch in which they are contained)
accommodate several jobs as a batch for processing plus the processing time of the batch. Once the process
simultaneously, with the total size of the batch not begins, it cannot be interrupted until the process is
exceeding machine capacity. Since different batching completed. Our goal is to find a schedule for the jobs so
groups require different available times and processing that the makespan, defined as the completion time of the
times, the batching and scheduling of the jobs is highly last job, is minimized. This model is expressed as
non-trivial and can greatly affect the production rate.
Many batch scheduling problems are NP-hard, i.e., for P | r j , s j , b 1 | C max .
many of them there does not exist any polynomial time Recently, many research efforts have been devoted to
algorithm unless P = NP. Researchers therefore turn to scheduling problems concerned with batch processing
studying approximation algorithms for these kinds of machines. These problems have either identical or non-
problems. The quality of an approximation algorithm is identical job size characteristics.
often measured by its worst-case ratio: the smaller the With regard to batch-processing machine scheduling
ratio is, the better the algorithm will be. We say that an problems with identical job size characteristics, Chandru,
algorithm has a worst-case ratio (or is a - Lee, and Uzsoy [1] proposed a branch-and-bound method
approximation algorithm) if for any input instance, it to minimize total completion time on a single batch-
always returns in polynomial time of the input size a processing machine and presented several heuristics for
feasible solution with an objective value not greater than identical parallel batch-processing machines as well. Lee,

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1203-1210
1204 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Uzsoy, and Martin-Vega [2] studied the single batch- researchers have applied metaheuristics to solve batch
processing machine problem and provided dynamic processing machine problems. Melouk et al. [21]
programming-based algorithms to minimize the number provided a simulated annealing approach to minimize
of tardy jobs and the maximum tardiness under a number makespan for scheduling a batch processing machine
of assumptions. They also provided two heuristic with different job sizes. An effective hybrid genetic
algorithms for the problem of parallel batch-processing algorithm is developed by Husseinzadeh Kashan et al.
machines with makespan criterion. Sung and Choung [3] [22], using a representation that could dominate a
proposed a branch-and-bound method to minimize the random-key based genetic algorithm and also the
makespan for a single batch-processing machine problem. simulated annealing approach by Melouk et al. [21].
Lee and Uzsoy [4] presented a number of efficient Kohetal. [23], proposed some heuristics and a random
heuristics to solve the single batch-processing machine key based representation genetic algorithm for the
problem with unequal release times. In addition, Li et al. problems of minimizing makespan and total weighted
[5] extended the study of the single batch-processing completion time on a batch processing machine within
machine problem by Lee and Uzsoy [4] to involve an compatible job families. A hybrid genetic algorithm is
examination of the identical parallel batch processing proposed by Chou et al. [24], to minimize makespan for
machines problem and proposed a polynomial time the dynamic case of the single batch processing machine
approximation scheme (PTAS). They also obtained the problem. Chou [25] developed a joint approach for
first PTAS for the problem of minimizing maximum scheduling in the presence of job ready times, based on
lateness on identical parallel batch processing machines the genetic algorithm in which the dynamic programming
[6]. Studies in identical job sizes were also done by algorithm is used to evaluate the fitness of the generated
Dupont and Ghazvini [7], Qi and Tu [8] and Wang and solutions. Parsa et al. [26] presented a branch and bound
Uzsoy [9]. algorithm to minimize makespan on a single batch
With regard to batch-processing machine scheduling processing machine with non-identical job sizes. The
problems with non-identical job size characteristics, scheduling problem with bi-criteria of makespan and
Uzsoy [10] derived complexity results for makespan and maximum tardiness by considering arbitrary size for jobs
total completion time criteria and provided some is also addressed by Husseinzadeh Kashan et al. [27].
heuristics and a branch and bound algorithm for the case Some researchers have also focused on scheduling with
of a single batch processing machine. Zhang et al. [11] non-identical job sizes on identical parallel batch
examined the worst-case performance of the heuristics processing machines (Koh et al. [28], Chang et al. [29]
addressed by Uzsoy [10] for the single machine and Husseinzadeh Kashan et al. [30]).
makespan problem. They also proposed an improved To the best of our knowledge, there has been no
algorithm with a 3/2 worst-case ratio. Li et al. [12] constant-ratio approximation algorithm for the general
presented a ( 2 ) -approximation algorithm for the P | r j , s j , b 1 | C max problem to date. In this paper we
single machine problem with release times, where combine the techniques of [5, 6, 12] to solve this problem
0 can be made arbitrarily small. Nong et al. [13] and present an approximation algorithm with worst-case
studied the problem of scheduling family jobs on a batch ratio 2 , where 0 can be made arbitrarily small.
processing machine to minimize the makespan and We use BPP (Batch Processing Problem) to denote the
presented an approximation algorithm with a 5/2 worst-
general problem P | r j , s j , b 1 | C max and use SBPP
case ratio. Dupont and Dhaenens-Flipo [14], on the other
hand, presented some dominance properties and proposed to denote the problem which is the same as BPP except
a branch-and-bound method to solve the single batch- that all jobs can be split in size. The outline of our main
processing machine scheduling problem with non- idea is as follows: we first get a PTAS for SBPP in
identical job sizes. Chung, Tai, and Pearn [15] considered Section 2, and then use it to get a ( 2 ) -
the parallel batch-processing machines with unequal approximation algorithm for BPP in Section 3.
release times and non-identical job sizes, which is
motivated by the aging test operation in the manufacture II. A PTAS FOR PROBLEM SBPP
of TFT-LCD. For this problem, they proposed a mixed
integer programming model and three heuristic In this section, we present a polynomial time
algorithms to minimize makespan. Wang et al. [16] approximation scheme for problem SBPP. We use opt
proposed the mixed integer programming model, genetic to denote the optimal makespan of problem SBPP.
algorithm and simulated annealing algorithm to solve the Throughout this section, if a job has been split in size and
scheduling problem of parallel batch-processing some part of it has been scheduled, the remaining part of
machines with unequal release times, non-identical job it will be treated as a single job.
sizes, and different machine capacities. Studies which The special case of P | r j , s j , b 1 | C max where all
discussed the total completion time objective were done
by Chang and Wang [17] and Ghazvini and Dupont [18]. 1
Recently, metaheuristics such as simulated annealing r j 0 and all s j is already strongly NP-hard [2],
B
(SA), tabu search (TS), and genetic algorithm (GA) have
been successfully employed in solving difficult where B ( 1 B n ) is an integer. Lee et al. [2]
combinatorial optimization problems. A number of observed that there exists an optimal schedule for this

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1205

special case in which all jobs are pre-assigned into into batches according to the MBLPT. Hence we get
batches according to the BLPT (full-batch-longest- d
processing-time) rule: rank the jobs in non-increasing opt .
order of processing times, and then batch the jobs by m
We use the MBLPT rule for all the jobs and get a
successively placing the B (or as many as possible) jobs
with the longest processing times into the same batch. number of batches. Starting from time rmax we schedule
To solve the general P | r j , s j , b 1 | C max problem, these batches by List Scheduling algorithm [20]:
whenever a machine is idle, choose any available batch to
we need the following modified version of the FBLPT
rule. start processing on that machine. Suppose that batch A
is the last batch to finish in the List Scheduling schedule.
MFBLPT Rule It must be the case that from time rmax on, no machine is
Index the jobs in non-increasing order of their idle prior to the start of batch A , otherwise we would
processing times. Place the job with the longest
have scheduled A earlier. So A must start no later than
processing time in a batch. If the batch has enough room
d
for the next job in the job list, then put the job in the rmax . Then A must finish no later than
batch; otherwise, place part of the job in the batch such m
that the batch is completely full and put the remaining d
part of the job at the head of the remaining job list and rmax p max . Hence we get
continue. m
d
A job is called a split job if it is split in size. We call a opt rmax p max , which completes the proof of
job available if it has been released but not yet assigned
m
the lemma.
into a batch. We call an available job suitable for a given
d
batch if it can be added in that batch. We call a batch
Let max{ rmax , p max , }. Round each
available if all the jobs in it have been released and it has m
not been scheduled. release time down to the nearest multiple of . After
We will perform several transformations on the given
getting a schedule for the rounded problem, we can
input to form an approximate problem instance that has a
simpler structure. Each transformation potentially increase each batchs start time by in the output to
increases the objective function value by O( ) opt , so
obtain a feasible schedule for the original problem. As
opt , we get the following lemma.
we can perform a constant number of them while still
staying within a 1 O ( ) factor of the original optimum. Lemma 2. With 1 loss, we can assume that all the
When we describe such a transformation, similar to [19], release times in an instance are multiple of , and the
we shall say that it produces 1 O ( ) loss. To simplify number of distinct release times is at most 1 / 1.
One can see that all the jobs in J can be scheduled in
notations we will assume throughout the paper that 1 /
d
is integral. the time interval [0, rmax p max ] . We partition
In the remainder of this section, we first simplify the m
problem by applying the rounding method. We proceed to
d
define short and long jobs and then present a PTAS for this time interval into h (rmax p max ) /
the case where all jobs are short. Finally, we get a PTAS m
for problem SBPP.
disjoint intervals in the form [ Ri , Ri 1 ) , where
A. Simplifying the Input
Ri (i 1) for each 1 i h and
We use the FBLPT rule for all the jobs and get a series
d
of batches. Denote by d the total processing time of Rh 1 rmax p max . Since
these batches. Let rmax max 1 j n r j . Then we get the m
d
following bounds for the optimal makespan of problem max{ rmax , p max , } , we have h 3 / 1 .
SBPP: m
Lemma 1. Note that each of the first h 1 intervals has a length ,
d
max{ rmax , p max , } opt rmax p max .
d and the last one has a length at most . By Lemma 2, we
m m can assume that every job in J is released at some Ri
Proof. It is obvious that opt max{rmax , pmax }. By a ( 1 i 1 / 1 ).
job-interchange argument, we observe that for the special We say that a job (or a batch) is short if its processing
case of problem SBPP in which all r j 0 , there exists time is smaller than ; and long, otherwise.
an optimal schedule in which all jobs are pre-assigned

2012 ACADEMY PUBLISHER


1206 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

We can assume that there are only a constant number We modify all the batches Bi , j in as follows:
of distinct processing times of long jobs, as the following
lemma states. reduce the processing time of each job in Bi , j to
Lemma 3. With 1 3 loss, the number of distinct q( Bi , j ) , 1 j k i , 1 i 1 / 1 . We call the
processing times of long jobs, k , can be bounded from
obtained batches modified batches. Each original job is
above by 1 / 1 / 1.
3
now modified into one new job if it has not been split, or
Proof. By Lemma 1 and the definition of long jobs, we two new jobs if it has been split. (Any original short job
know that for each long job j , p j (1 / ) . will be split at most once.) We call the new jobs modified
jobs. Then we define two accessory problems:
We round each long jobs processing time down to the
SBPP1: To schedule the modified batches to minimize
nearest integral multiple of . This creates a rounded
2
makespan.
instance in which there are at most SBPP2: To schedule the modified jobs to minimize
[(1/ ) ] /( 2 ) ( ) /( 2 ) 1 1/ 3 1/ 1 makespan.
Both these problems deal with the modified jobs. But
distinct processing times of long jobs. Hence we get
while SBPP1 demands to leave the grouping of the
k 1 / 3 1 / 1. Consider the optimal value of the modified jobs into batches as dictated by the MFBLPT
rounded instance. Clearly, this value cannot be greater rule, SBPP2 allows the re-opening of the batches and
than opt , the optimal makespan of problem SBPP. As playing with the grouping into batches. Hence, SBPP2
there are at most 3 / long batches in any optimal
2 might obtain a better makespan. However, we are going
to prove that this is not the case by showing
schedule in the rounded instance, by replacing the
rounded values with the original ones we may increase that opt1 opt 2 opt , where opt1 and opt 2
the solution value by at most denote the optimum values to SBPP1 and SBPP2,
(3 / 2 )( 2 ) 3 3 opt . respectively.
Any optimal solution to SBPP1 is a feasible solution to
B. Short Jobs SBPP2, therefore we get opt1 opt 2 . On the other
In this subsection we concentrate on the case in which hand, any optimal solution to SBPP2 can be transformed
all the jobs are short. Based on the ideas of [5, 12], we into a feasible solution to SBPP1 without increasing the
present a very simple and easy to analyze approximation objective value, which implies that opt1 opt 2 . To
scheme for this case.
show this, let us fix an optimal solution, 2* , to SBPP2.
Denote by J i the subset of jobs in J that are
Suppose that A is the batch which starts earliest among
released at Ri ( 1 i 1 / 1).
the batches in 2* with the longest processing time.

Algorithm ScheduleShort Suppose that A is the batch which becomes available


Step 1. Use the MFBLPT rule for all the jobs in earliest among the modified batches with the longest
processing time. We exchange the modified jobs which
J1 , J 2 ,, J1 / 1 , respectively. are in A but not in A and the modified jobs which are
Step 2. Use the List Scheduling algorithm [20] to in A but not in A without increasing the completion
schedule the obtained batches.
time of any batch in 2* . Consequently, A appears in
modified 2 . Repeat this procedure until all the modified
Theorem 1. If all the jobs are short, then Algorithm *
ScheduleShort is a PTAS for problem SBPP.
batches except those with processing time zero appear in
Proof. Let be the schedule produced by Algorithm
modified 2 . The modified jobs with processing time
*
ScheduleShort. Suppose that Bi ,1 , Bi , 2 , , Bi ,ki are the
zero are fully negligible and thus can be batched in such a
whose jobs are from J i ( 1 i 1 / 1 )
batches in way that the modified batches with processing time zero
appear in modified 2 . We eventually achieve a feasible
*
such that Bi ,1 , Bi , 2 , , Bi ,ki 1 are full batches and
solution to SBPP1, whose makespan is not greater than
q( Bi , j ) p( Bi , j 1 ) , where p( Bi , j ) denotes the
that of 2* . It follows that opt1 opt 2 . Therefore we
processing time of the longest job in Bi , j , and q( Bi , j )
get opt1 opt 2 . It is obvious that opt 2 opt . Hence
denotes the processing time of the shortest job in Bi , j if we get opt1 opt 2 opt .
Bi , j is full and is set to zero otherwise. Then we have the Consider a schedule, denoted by , which is obtained
following observation: by using the List Scheduling algorithm for all the
ki
modified batches. Then we have
( p( B
j 1
i, j ) q( Bi , j )) p( Bi ,1 ) . (1)
C max ( ) C max ( )
1 / 1 ki

( p( B i, j ) q( Bi , j )).
i 1 j 1

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1207

Inequality (1) implies that the second term on the right- than (3 ) opt . This completes the proof of the
2
hand side of the above inequality is bounded from above
lemma.
by (1/ 1) ( 2 ) opt . Hence we get: Combining Lemma 5 and Theorem 1, we can
C max ( ) C max ( ) ( 2 ) opt. (2) determine the batch structure of short jobs at the
beginning of the algorithm as follows: use the MFBLPT
On the other hand, we claim that
rule for all the short jobs in J i ( 1 i 1 / 1 ) and
C max ( ) (1 2 ) opt. Suppose that A is the
2
get a series of short batches.
last batch to finish in . Consider the latest idle time The idea for dealing with long jobs is essentially based
point t1 prior to the start of batch A . It is easy to see on enumeration. Recall that the number of distinct
processing times of long jobs, k , has been bounded from
that t1 must be a release time, i.e., one of the ends of the
above by 1 / 1 / 1 (Lemma 3). Without loss of
3
first 1 / intervals. Since all batches in are short, any
batch that starts before t1 must finish earlier than
generality, let P1 , P2 ,, Pk be the k distinct
processing times of long jobs. Suppose further that
t1 . By the rule of List Scheduling algorithm, any
P1 P2 Pk . We now turn to the concepts of
batch which starts after t1 cannot be released earlier than machine configurations and execution profiles.
t1 , otherwise it should be scheduled earlier. From t1 let us fix a schedule, .We delete from all the jobs
onwards, no machine is idle prior to the start of batch A . and the short batches, but retain all the empty long
It follows that batches, which are represented, respectively, by their
processing times. For a particular machine, we define a
C max ( ) opt1 2 (1 2 ) opt. Thus the
2
machine configuration, with respect to , as a vector
claim holds. The claim, together with inequality (2), (c1 , c2 ,, c3 / 1 ) , where ci consists of all the empty
implies that C max ( ) (1 3 2 ) opt , long batches started on that machine in interval
completing the proof of the theorem. [ Ri , Ri 1 ) , 1 i 3 / 1 . For the sake of clarity, we
C. General Case defineci equivalently as a k -tuple ( xi1 , xi 2 ,, xik ) ,
We are now going to establish a PTAS to solve the
general SBPP problem. where xij is the number of empty long batches started in
By the job interchange argument, we get the following interval [ Ri , Ri 1 ) on the machine with Pj as their
lemma which plays an important role in design and
analysis of our algorithm. processing times, 1 i 3 / 1 , 1 j k .
Lemma 4. There exists an optimal schedule with the The processing time of a long batch is chosen from the
following properties: k 1 / 3 1 / 1 values. When ci contains l
(1) on any one machine, the batches started (but not

k
necessarily finished) in the same interval are processed empty long batches (i.e.,
j 1
xij l ), the number of
successively in the order of non-increasing batch
l
processing times, and different possibilities is not greater than k . Since a
(2) from time 0 onwards, interval by interval, the feasible schedule has the property that on any one
batches started in the same interval are filled in the order machine, at most 1 / long batches are started in each of
of non-increasing batch processing times such that each the intervals, the number of machine configurations to
batch contains as many as possible of the longest suitable consider, , can be roughly bounded from above by
jobs, and
(3) any job can be split in size whenever necessary, (1 k k 2 k l ) 3 / 1 23 / 1 k 3 / 1 .
therefore all the batches in the same interval are full This allows us to say that, for a given schedule, a
batches except possibly the shortest one. particular machine has a certain configuration. We denote
The following lemma is useful: the configurations as 1,2, , . Then for any schedule,
Lemma 5. With 1 3 loss, we can assume that
2
we define an execution profile as a tuple
no short job is included in long batches. (m1 , m2 ,, m ) , where mi is the number of
Proof. By Lemma 4, there exists an optimal schedule in
which only the last long batch in each interval may machines with configuration i for that schedule.

contain short jobs. Therefore, we can stretch those Therefore, there are at most (m 1) execution profiles
intervals to make extra spaces with length for the to consider, a polynomial in m .
short jobs that are included in the long batches. Since We next present our algorithm.
there are 3 / 1 intervals, we may increase the
solution value by at most (3 ) , which is no more Algorithm ScheduleSplit
Step 1. Get all possible execution profiles.
Step 2. For each of them, do the following:

2012 ACADEMY PUBLISHER


1208 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

(a) Assign a configuration for each machine according Step 2: Move out all split jobs from 1 and open a
to the profile. If this is not possible, delete the profile.
new batch for each of them.
(b) On each machine in each interval, start the
Step 3: Process the new batches successively at the end
specified empty long batches as early as possible in the
order of non-increasing processing times. If some batch of 1 , on the same machines as the corresponding
has to be delayed to start in one of the next intervals, then splitting batches in 1 , where 1 is the schedule that is
delete the profile.
(c) From time 0 onwards, interval by interval, fill the obtained from 1 after removing from it all split jobs.
empty long batches started in the same interval in the
order of non-increasing batch processing times such that Theorem 3. Algorithm ScheduleWhole is a ( 2 ) -
each of them contains as many as possible of the longest approximation algorithm for problem BPP, where
suitable jobs (any job can be split in size whenever
necessary). If some long job cannot be assigned into a
0 can be made arbitrarily small.
Proof. Denote by the schedule given by Algorithm
batch and has to be left, then delete the profile.
*
(d) Run Algorithm ScheduleShort in the spaces left by ScheduleWhole. Let Cmax and C max be the makespans
the long batches and get a feasible schedule. If a short of and an optimal schedule for BPP, respectively.
batch crosses an interval, we stretch the end of the Recall that opt denotes the optimal makespan of
interval to make an extra space with length for it
such that it need no longer cross the interval. problem SBPP. It is obvious that opt C max and
*
is
Step 3. From among the obtained feasible schedules, a feasible schedule for BPP. Note that consists of two
select the one with the smallest makespan. parts, one of which is 1 and another consists of the new
batches opened for the split jobs. The completion time
Theorem 2. Algorithm ScheduleSplit is a PTAS for the

general SBPP problem.
C1 of the former part is no more than (1 ) opt . Let
Proof. By Lemma 4, the long batches started in the same 2
interval on the same machine can be arranged in the order
us consider the maximum total processing time C 2 on
of non-increasing batch processing times. Note that we
can stretch the end of an interval to make an extra space any machine of the latter part. From Algorithm
with length for a crossing short batch such that it ScheduleSplit, each batch splits at most one job and each
need no longer cross the interval. Therefore given an job can be split at most once in 1 . Since the processing
execution profile, we can first start the empty long time of a split job cannot be greater than the
batches as early as possible while keeping them in the corresponding splitting batch, it follows that C 2 C1 .
specified intervals, and then run Algorithm

ScheduleShort in the spaces between them. Thus we get C max C1 C 2 2(1 ) opt
Any optimal schedule is associated with one of the 2
(m 1) execution profiles. Given an execution profile (2 ) opt (2 ) C max
*
.
that can lead to an optimal schedule, our way to deal with This completes the proof of the theorem.
long jobs in Algorithm ScheduleSplit is optimal, while Note that in the algorithm the treatment of the split
invoking Algorithm ScheduleShort will yield at most jobs is very trivial (each one in its own batch and all the
1 3 2 loss. Combining Lemmas 2, 3 and 5, by new batches are processed at the end of 1 ). Is it
taking the smallest one among all obtained feasible possible to improve this and get a better worst-case ratio?
schedules, Algorithm ScheduleSplit can be executed with In [12], the authors showed an example to explain why
at most 1 8 4 loss.
2
more involved techniques for batching the split jobs do
It is easy to see that the time complexity of Algorithm not seem to yield a better worst-case ratio. One might
ScheduleSplit is O(n log n n (m 1) 1 ) . expect that we can make a more educated choice of the
new batches start times to improve the ratio. For
example, each new batch starts immediately after the
III. AN ALGORITHM FOR PROBLEM BPP completion of the corresponding splitting batch. However,
Now we start to construct an approximation algorithm this is not the case, because the generic bad cases are the
for BPP. We say that a batch splits a job if it contains same.
some part but not the last part of the job, and the batch is In Algorithm ScheduleWhole, Step 1 can be executed
now called a splitting batch. in O(n log n n (m 1) 1 ) time, while Steps 2 and
3 can be executed in O (n ) time, therefore this algorithm
Algorithm ScheduleWhole
can be implemented in O(n log n n (m 1) 1 )
Step 1: Get a (1 ) -approximation schedule 1 for time.
2
SBPP by Algorithm ScheduleSplit.
ACKNOWLEDGMENT

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1209

This work is supported by the National Natural sizes. International Journal of Production Research.
Science Foundation of China (60970105), National doi:10.1080/00207540802010807.
Natural Science Foundation of China for Distinguished [16] Wang Hui-Mei and Fuh-Der Chou (2010). Solving the
Young Scholar (11161035), Special Fund of Shandong parallel batch-processing machines with different job sizes,
and capacity limits by metaheuristics. Expert Systems with
Provincial Information Industry Department (No. Applications, 37, 1510-1521.
2008X00039), and Shandong Provincial Soft Science [17] Chang, P.-C., & Wang, H.-M. (2004). A heuristic for a
Research Program (2011RKGB5040). batch processing machine scheduled to minimize total
completion time with non-identical job sizes. International
REFERENCES Journal of Advanced Manufacturing Technology, 24, 615
620.
[1] Chandru, V., Lee, C.-Y., & Uzsoy, R. (1993). Minimizing [18] Ghazvini, F. J., & Dupont, L. (1998). Minimizing mean
total completion time on batch processing machines. flow times criteria on a single batch processing machine
International Journal of Production Research, 31, 2097 with non-identical jobs sizes. International Journal of
2121. Production Economics, 55, 273280.
[2] Lee, C.-Y., Uzsoy, R., & Martin-Vega, L. A. (1992). [19] F. Afrati, E., C. Chekuri, D. Karger, C. Kenyon, S. Khanna,
Efficient algorithms for scheduling semiconductor burn-in I. Milis, M. Queyranne, M. Skutella, C. Stein, M.
operations. Operation Research, 40, 764775. Sviridenko (1999). Approximation schemes for minimizing
[3] Sung, C. S., & Choung, Y. I. (2000). Minimizing average weighted completion time with release dates,
makespan on a single burn-in oven in semiconductor Proceedings of the 40th Annual IEEE Symposium on
manufacturing. European Journal of Operational Research, Foundations of Computer Science, New York, October,
120, 559574. 3243.
[4] Lee, C.-Y., & Uzsoy, R. (1999). Minimizing makespan on [20] R. L. Graham (1966). Bounds for certain multiprocessor
a single batch processing machine with dynamic job anomalies. Bell System Technical Journal, 45: 15631581.
arrivals. International Journal of Production Research, 37, [21] Melouk S, Damodaran P, Chang P-Y. Minimizing
219236. makespan for single machine batch processing with non-
[5] Shuguang Li, Guojun Li, Shaoqiang Zhang. (2005). identical job sizes using simulated annealing. International
Minimizing makespan with release times on identical Journal of Production Economics, 2004, 87: 1417.
parallel batching machines. Discrete Applied Mathematics, [22] Husseinzadeh Kashan A, Karimi B, Jolai F. Effective
148, 127134. hybrid genetic algorithm for minimizing makespan on a
[6] Shuguang Li, Guojun Li, Shaoqiang Zhang. Minimizing single batch processing machine with non-identical job
maximum lateness on identical parallel batch processing sizes. International Journal of Production Research, 2006,
machines. Lecture Notes in Computer Science 3106: 44: 233760.
Proceedings of the 10th Annual International Conference [23] Koh S-G, Koo P-H, Kim D-C, Hur W-S. Scheduling a
on Computing and Combinatorics, 229237, 2004. single batch processing machine with arbitrary job sizes
[7] Dupont, L., & Ghazvini, F. J. (1997). A branch and bound and incompatible job families. International Journal of
algorithm for minimizing mean flow time on a single batch Production Economics, 2005, 98: 8196.
processing machine. International Journal of Industrial [24] Chou FD, Chang PC, Wang HM. A hybrid genetic
Engineering, 4, 197203. algorithm to minimize makespan for the single batch
[8] Qi, X., & Tu, F. (1999). Earliness and tardiness scheduling machine dynamic scheduling problem. International
problems on a batch processor. Discrete Applied Journal of Advanced Manufacturing Technology, 2006, 31:
Mathematics, 98, 131145. 3509.
[9] Wang, C.-S., & Uzsoy, R. (2002). A genetic algorithm to [25] Chou FD. A joint GA+DP approach for single burn-in
minimize maximum lateness on a batch processing oven scheduling problems with makespan criterion.
machine. Computers &Operations Research, 29, 1621 International Journal of Advanced Manufacturing
1640. Technology, 2007, 35: 58795.
[10] Uzsoy, R. (1994). Scheduling a single batch processing [26] N. Rafiee Parsa, B. Karimi, A. Husseinzedeh Kashan. A
machine with non-identical job sizes. International Journal branch and bound algorithm to minimize makespan on a
of Production Research, 32, 16151635. single batch processing machine with non-identical job
[11] Zhang, G., Cai, X., Lee, C.-Y., & Wong, C. K. (2001). sizes. Computers & Operations Research, 2010, 37 (10):
Minimizing makespan on a single batch processing 1720-1730.
machine with nonidentical job sizes. Naval Research [27] Husseinzadeh Kashan A, Karimi B, Jolai F. Bi-criteria
Logistics, 48, 226240. scheduling on a single batch processing machine with non-
[12] Shuguang Li, Guojun Li, Xiaoli Wang, Qiming Liu. identical job sizes. In: Proceeding of the 12th IFAC
Minimizing Makespan on a Single Batching Machine with symposium on information control problems in
Release Times and Non-Identical Job Sizes. Operations manufacturing, INCOM2006, St-Etienne, France, 2006b.
Research Letters, 33(2): 157164, 2005. [28] Koh S-G, Koo P-H, Ha J-W, Lee W-S. Scheduling parallel
[13] Q.Q. Nong, C.T. Ng and T.C.E. Cheng (2008). The batch processing machines with arbitrary job sizes and
bounded single-machine parallel-batching scheduling incompatible job families. International Journal of
problem with family jobs and release dates to minimize Production Research, 2004, 42: 409141107.
makespan, Operations Research Letters, 36(1), 61-66. [29] Chang P Y, Damodaran P, Melouk S. Minimizing
[14] Dupont, L., & Dhaenens-Flipo, C. (2002). Minimizing the makespan on parallel batch processing machines.
makespan on a batch machine with non-identical job sizes: International Journal of Production Research, 2004, 42:
An exact procedure. Computers & Operations Research, 29, 421120.
807819. [30] Husseinzadeh Kashan A, Karimi B, Jenabi M. A hybrid
[15] Chung, S. H., Tai, Y. T., & Pearn, W. L. (2008). genetic heuristic for scheduling parallel batch processing
Minimising makespan on parallel batch processing
machines with non-identical ready time and arbitrary job

2012 ACADEMY PUBLISHER


1210 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

machines with arbitrary job sizes. Computers & Operations Technology, Shandong Institute of Business and Technology,
Research, 2008, 35: 108498. Yantai, China. His current research areas are combinatorial
optimization and theoretical computer science.

Shuguang Li was born in Shandong, China in 1970. He


received the PhD degree in operations research and cybernetics
from the Shandong University, Jinan, China, in 2007. He is an
associate professor in the College of Computer Science and

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1211

Ontology and CBR-based Dynamic Enterprise


Knowledge Repository Construction
Huiying Gao
School of Management and Economics, Beijing Institute of Technology, Beijing, China, 100081
Email: huiying@bit.edu.cn

Xiuxiu Chen
School of Management and Economics, Beijing Institute of Technology, Beijing, China, 100081
Email:xiuxiuchen77@163.com

Abstract- The efficiency of knowledge sharing and learning will absorb and recreate knowledge in the organization
is the key to obtain sustainable development for the with strong knowledge storage and retrieval capabilities
knowledge-intensive industry. However, current application [3].Therefore, the knowledge repository construction will
of enterprise knowledge repository can hardly adapt to the play a vital role in the knowledge transformation and
personalized retrieval with semantic expansion and can not
sharing between personal and the enterprise.
support the dynamic mechanism of knowledge sharing. This
paper focuses on an integrated framework and operating The massive domain knowledge resources and
processes of dynamic knowledge repository construction. experiences are accumulated in many
Through analyzing the key technology points of business knowledge-intensive enterprises in recent years. However,
logic processing layer and data services layer particularly, there are still many urgent problems that they have to
the ontology and CBR-based knowledge storage and face. There is lack of general knowledge model for
retrieval mechanism are studied, which improve the domain support, and there is lack of semantic support for
effectiveness of knowledge management. the knowledge case representation, retrieval and reuse.
The organization of knowledge repository fixes in a
Index Termsontology; case based reasoning; knowledge
single form and the hierarchical structure is ambiguous.
repository; semantic retrieval
Besides, the knowledge repository with weak case
learning ability cannot meet the dynamic mechanism of
I. INTRODUCTION knowledge reuse.
This paper aims at a dynamic enterprise knowledge
With the development of knowledge economy and the repository construction on ontology and case-based
superheating competition of the market, many reasoning (CBR). At first in section II the ontology and
knowledge-intensive industries such as aviation, CBR are briefly introduced and the state of art is
advanced manufacturing, IT and consulting are suffering summarized. After that in section III the framework of
from the distress of increasing knowledge assets outflow, ontology and CBR-based dynamic enterprise knowledge
where technical requirements span is wide and repository is proposed and the key technology points of
management task is difficult and complicated. The business logic processing layer and data services layer are
recent research paper in the Journal of Knowledge described respectively. Take an information system
Management shows that knowledge application is consulting company as an example, in section IV, the
directly related to organizational performance [1]. A good process of the enterprise domain ontology is shown and
environment of knowledge sharing and learning will the dynamic mechanism of case repository is designed in
improve the enterprises efficiency and service; finally section V, including the case representation, organization,
make it get sustainable competition. retrieval and learning. Finally section VI displays our
However, the most precious knowledge of the conclusion and outlook.
enterprise often exists in the minds of its employees,
work processes, experiences, and in electronic or written II. ONROLOGY AND CASE-BASED REASONING
form, etc. The research paper published in the
Information Systems Research in 2010 pointed that A. Ontology
knowledge capabilities with IT contribute to firm Ontology is the formal clear specification of the
innovation [2], and also demonstrated that individuals sharing conceptual model; it captures the basic domain
terms and their relationships, defines the relevant rules to
determine the vocabulary extension, and finally forms a
This work was supported in part by the National Natural Science
Foundation of China under Grant 71102111 and Beijing Institute of
knowledge structure model in specific areas in order to
Technology under Grant 3210050320908. achieve the consistent understanding of the domain
Corresponding authorHuiying Gao Email: huiying@bit.edu.cn knowledge [4]. As ontology provides a clear semantic

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1211-1218
1212 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

and knowledge description of the concepts and The main functions of each layer are described as
interrelation, it can be adapted to the cases despription follows.
and hierarchical structure storage and support the
Customer
semantic knowledge retrieval. Application
Layer
System
Administrator
Knowledge
Engineer
Normal
Users

B. Case-based Reasoning Knowledge Case


Case
Retrieval
Case
Reasoning
Case Case
Business System Acquisition Definition Modify Learning
Processing Matching
Logic Module Module
CBR is an important problem solving and learning Processing
Layer
Management
Module
Module
Module Module
Module

method based on knowledge in the artificial intelligence Ontology Management Module

field. It has good extensibility and the ability to learn [5].


User
Each processed problem is described as the feature set Data
Services
Layer
Information
Database
Ontology
Database
Case
Database
Multimedia
and solutions, then stored as a case in the system.When Library

the new problem comes, the most similar case will be


retrieved and modified if necessary. The modified cases System
Layer
WWW
ation
munic ls
Co m
will be seen as an new case and stored in order to realize Server OS
OS
DBMS Proto
co

the reuse and relearning of the cases. Also , case retrieval


will be the key point of the case reasoning. Figure1. The framework of dynamic knowledge repository
The customer application layer provides a good
C. Related Research interaction interface for the users including knowledge
Dynamic knowledge repository management is a hot engineers, normal users and the system administrator.
issue in the current information field. Many scholars have The business logic processing layer encapsulates the core
done a great deal of researches based on ontology. functional modules of the knowledge repository system,
In the Journal of Information Science, many related which are responsible for knowledge acquisition,
researches are delivered such as knowledge extraction [6], representation, case definition and storage, ontology
the uniform knowledge representation [7], the knowledge analysis, as well as case retrieval, learning and so on. The
matching and retrieval [8] with ontology technology and data service layer is the basic part of this system, which
so on. Moreover, in the Journal of Knowledge logically realizes the expression and organization of the
Management, information processing based on ontology user information database, multimedia database, ontology
in the construction process of the knowledge management database and case database. The system layer aims at
system is explored [9]. Besides, Liao Liangcai imported offering operation system, database management system,
ontology into the knowledge management system and Server, data standards, network, communication
realized the enterprise knowledge management through protocols, and many other physical supports.
the semantic expansion, reasoning and retrieval finally As the three types of users have different functional
[10]. All these researches show the important role of requirements, the operational processes will been
ontology in the realization of knowledge sharing and analyzed and illustrated respectively as follows.
reuse. Generally speaking, knowledge engineers firstly
Meanwhile, the unstructured knowledge such as acquire explicit and tacit knowledge from related experts,
experiences and minds of employees is more suitable to enterprises original nonstructural database and many
store in the form of case, and it is easier to realize the other channels with the knowledge acquisition module.
dynamic knowledge management based on CBR. So the Secondly, the core domain concepts are extracted and
following research focuses on dynamic enterprise enterprise ontology database is constructed and
knowledge repository model with the effective maintained by the ontology management module.
combination of ontology and CBR technology. With the According to the hierarchical structure of ontology,
establishment of domain ontology, the semantic knowledge engineers annotate the semantic information
consistency for knowledge representation and storage is and definition for the cases, and build case classification
ensured and semantic expansion of the users query index. Finally, the metadata of the defined cases is stored
demand is realized. Besides, case modification, learning, in the case base while non-structured data or the original
reuse and new case formation enhance the adaptability of documents are stored in multimedia library with the XML
knowledge repository and realize its dynamic format.
construction. Normal users follow a different process. Firstly,
several key words are input through the case retrieval
III. ONTOLOGY AND CBR-BASED DYNAMIC ENTERPRISE processing module, and then semantic annotation will be
KNOWLEDGE REPOSITORY FRAMEWORK added based on the ontology, the users profile and the
retrieval history. Then the users query can be represented
A. Integrated Framework by the semantic vector, which will be matched with
An ontology and CBR-based dynamic enterprise source cases later. Finally, cases beyond a certain
knowledge repository model is discussed in this part. The threshold are sent to the client application layer.
basic framework is illustrated in Fig.1, which adopts a Sometimes, failure cannot be avoided. When the users
four-tier system structure: customer application layer, needs fail to be met, some cases will be combined
business logic processing layer, data services layer and together and refined according to ontology by case
system layer. All these different layers work closely modify module. In this way, a new case will be formed
together to complete the work of knowledge management. and stored in case database through case learning module,

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1213

while other additional information will be added to the dimension of query vector can be replaced in the
multimedia library. It can be seen obviously that ontology semantic view.
management module provides semantic support through 6) Case Reasoning Matching ModuleThe function of
the whole process of dynamic management. this module is to achieve the semantic matching between
System administrator realizes the management of the users query and the annotated cases. Much work is to
different user permission with reference to the users expand synonym concept and clarify the ambiguous
records in the user information database through the query information or intension. Based on the results of
system management module. users query pre-processing, query vector with semantic
As business logic processing layer and data service expansion is generated for the semantic matching with
layer play an important role in the dynamic knowledge the representation vectors of source cases in the case
repository system, they will be discussed further in the database. After computing the similarity between them
part B and part C. with certain retrieval strategy and algorithm, cases
B. Business Logic Processing Layer beyond certain threshold will be returned in order.
7) Case Modify ModuleAs the scale of the knowledge
The business logic processing layer of the dynamic
repository is so limited that it cannot satisfy all users
knowledge repository system put forward in this paper
demands, the module provide the function of case
includes the following main function modules.
combination and adjustments according with ontology
1) System Management ModuleIt is responsible for defined to form new cases.
controlling the access authorities of the different types of
8) Case Learning ModuleThe function of this module
users. For example, the administrator manages users
is to learn the modified cases automatically according to
access; knowledge engineers deal with knowledge
certain rules and enrich the case database gradually.
management and maintenance, while normal users
Specific information will be described in section V .
retrieve knowledge.
2) Knowledge Acquisition ModuleTwo ways can be C. Data Services Layer
adopted to acquire domain knowledge. One approach is According to the category of enterprise knowledge,
to arouse the domain experts initiatives to obtain they are stored in the User Information Database,
knowledge through the brain storm or Delphi method, Ontology Database, Case Database and Multimedia
which is simple, direct and efficient. But certain Library.
dependence on the experts will be the main problem. The User information database is designed to store all the
other way is to dig for knowledge from the present users' personal background information, retrieval history
existing results, including the enterprise non-structural and reuse records. The prescriptive documents and
database, internal materials, the patent documents, relational database reflect users needs and preferences in
intranet, extranet, BBS and other internal communication, order to improve the pragmatic retrieval.
which will be the preparation for the ontology The ontology database is suggested to maintain the
construction. domain concepts, the properties, the attribute constraints
3) Ontology Management Module Based on the and relations between these concepts, and finally form the
analysis of the domain scope and characteristics, concept model with clear structure. Ontology is so
knowledge engineers extract the core concepts as well as important that many other parts of the knowledge
their respective sub-categories from the enterprise. Also, repository system are established based on it. It
the attributes of each sub-category and the constraints of contributes to the domain knowledge reasoning, case
the attribute are also given. Finally, architecture of retrieval, matching and learning for dynamic knowledge
ontology will be set up with various interrelationships, repository management.
such as project classification system, enterprise business The case database is prepared to store the accumulated
department classification system and so on. All these cases of the enterprise for a long term. It aims at offering
domain ontology will form ontology modeling and be reference for the subsequent ones, which is vital to realize
stored in ontology database. The detail will be illustrated knowledge reuse effectively.
in the section IV. Multimedia library stores semi-structural even
4) Case Definition ModuleWith reference to the non-structured data such as the corresponding document
ontology model, some key features are extracted from the of project cases, related project design, the flow chart,
documents, the existing project cases and long-term technology and method, the source code, photos, and
accumulated experiences, which are used to express the video conference and so on.
corresponding cases in the particular way. Additionally,
the ontology- based case classification index mechanism IV. CONSTRUCTION OF ENTERPRISE ONTOLOGY
will be established in order to organize the cases clearly. It is obviously that enterprise ontology is the basis of
5) Case Retrieval Processing ModuleThe function of the whole knowledge system and determines the
this module is to segment and analyze the users query, performance as well as the quality of the operation.
and then extend to the different extracted words. The Therefore, how to establish enterprise ontology correctly,
words in user queries can directly map to the concepts, effectively and logically is very important.
attributes or the instances of ontology, with which every Based on the dynamic knowledge repository
framework above, we adopt the framework method [4]

2012 ACADEMY PUBLISHER


1214 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

given by Uschold & Gruninger to build the enterprise


ontology with ontology modeling tools protg 3.3.4.
Taking an information system consulting company as an
example, Fig. 2 and Fig. 3 illustrate the hierarchy and the
fragment of the ontology respectively. Firstly, according
to the outline, we identify enterprise domain ontology
scope in many ways with knowledge engineers and other
experts, such as research, interview, brainstorming and so
on. Secondly, the core domain concepts are extracted
after the analysis and evaluation, and the interrelation and
hierarchical structure are defined as well. The main
relationships of the ontology may be <is A>, <a part of>,
<equal to>, <similar to>, <instance of> and <attribute of>.
Fig.2 shows synonymous relationship <equal to>and
hyponymy such as <subclass of>. Finally, the attributes
of each sub-category and the constraints of the attribute
are also given and the ontology is described with OWL.
A small section of the source code with OWL is shown
in the following, which is generated automatically by the
ontology modeling tools protg3.3.4 after the ontology
is set up. As we can see, Video conference system is Figure 2. Ontology hierarchy of project types
equal to Video session system, and Multi-media
project is the subclass of Project.
</owl:Class>
<owl:Class rdf:ID="Video_conference_system">
<owl:equivalentClass
rdf:resource="#Video_session_system"/>
<rdfs:subClassOf
rdf:resource="#Multi-media_project"/>
</owl:Class>
</owl:Class>
<owl:Class rdf:ID=" Multi-media_project ">
<rdfs:subClassOf rdf:resource="#Project"/>
</owl:Class>
Furthmore ,object properties are defined as follows
Figure 3.Ontology fragment of project types
<owl:ObjectProperty rdf:ID="has_name"/>
<owl:ObjectProperty rdf:ID="has_budget"/>
<owl:ObjectProperty rdf:ID="has_constructors"/> VONTOLOGY-BASED CASE DATABASE CONSTRUCTION
<owl:ObjectProperty rdf:ID="has_goals"/> A. Case Representation
<owl:ObjectProperty rdf:ID="has_owner"/>
Actually, case representation is a kind of knowledge
<owl:ObjectProperty rdf:ID="has_requirement"/>
expression, which code knowledge into a set of data
As ontology exerts strong influences on the utilization
structure for computer. In this paper the case database is
of knowledge repository, the evaluation method based on
described as: {case1,case
2,...,case, ,..., }; and
the feedbacks of users is adopted in the domain ontology
assessment. That is to say, knowledge engineers analyze case is illustrated with ordered pairs <case ID, case title,
the richness of ontology information (including concepts, initial problem description, solutions description,
attributes, and the definition of an example) and semantic additional information>.Let we take=ID,TI,IN,SO,AD> for
intensity (including ontology structure and relationships)
based on the basis of users satisfaction. short. The unique identifier of one case is expressed
Except the evaluation, domain ontology maintenance is with ID . SO refers to the case solution and the
also necessary. It mainly refers to a series of adjustments, specification, including performance, causes, the main
error corrections, perfect and adaptability maintenance problems, the economic and social benefits, which will be
work. Correction maintenance focuses on put right in the an significant basis of case reuse. AD is related with
use. Perfect maintenance refers to ontology expansion corresponding documents and multimedia content.
work with the increase of knowledge. Adaptive Document element and other unstructured data are more
maintenance points to the refreshment of the structure, suitable to store in the form of multi-media format, which
attributes and relationships with the changes of is convenient for the user to understand the whole case.
environment. With tracking and management of ontology We divide the case retrieval into case title part
model continually, it will provide better support for the represented by TI and initial problem description part
dynamic knowledge repository. represented by IN , which are the basis of the case
retrieval. With the consideration of ontology database,

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1215

feature vector is extracted from these two parts through relationship can we retrieve the potential information,
analysis and expressed by such as project, multimedia and video conference.
Ci t
{ i1 ,w 1 t
i2 ,w
2 , ( tim ,wm )},i 1 2
n.

According to the case representation method illustrated
and tij represents j th feature of i th case, Wij is the in the part A of this section, suppose there are
m keywords, and one semantic vector can be expressed
corresponding weight of the j th feature 1 j m .
as Ti ti1 , ti 2 tim i 1, 2,..., n .

B. Case Organization Based on semantic expansion with the domain
In order to improve the speed of case retrieval, the ontology, each of the key word can be expressed as a
system organizes the cases in the hierarchical structure semantic vector, which will be used to match with the
according to case classification ontology. Then index keywords defined in the source cases. According to the
mechanism is built on the different fields, such as the description in the part A of this section, as the importance
project classification or industry index, as shown in Fig.4. between case title part and initial problem description
So, when the user input with some search features, part will be different in concrete applications, we should
appropriate index is adopted to search the optimized consider the weight of one keyword appeared in the title
cases efficiently with sematic expansion. part and initial problem description part. Let represent
the weight of the title part; let represent the weight of
Project initial problem description part and 1 . Generally
speaking, the main content will be shown more obviously
Network
in the title of one case, so we define in the case
E-government E-business Multi-media Enterprise
project project project
Construction
project
... management retrieval process of this paper.
Suppose C as a candidate case set, and Ci C is one
Gold Government
Video Video
Distance
candidate case of the candidate case set,
Conference Surveillance ... OA ERP
Project Portals
system system
Education ...
then C * represents the query vector that is consistent with
Golden Golden the users demand. We define the feature vector of a case
Finance Tax
Project Project
expressed as Ti ti1 , ti 2 tim . So the semantic
Figure4. The case organization based on project types
similarity between Ci and C * can be defined as (1).
C. Case Retrieval SIM (CiC* ) SIM (CiC* ) SIM (C C * )
TI IN i (1)
The case retrieval is conducted with the semantic
( , and , + =1)
expansion based on the enterprise ontology. The reason
the semantic query expansion with ontology is that in In the formula above SIMTI (C C* ) means the i

query language there are several situations as follows. similarity of the title part and SIM IN (C C*) means i

Firstly, there are many synonyms. It is quite common similarity of the initial problem description part.
in natural language, for example E-commerce While computing the value of SIMTI (C C* ) , the i

and E-business both mean commerce conducted frequency of each keyword in the vector space in every
electronically. The relationship between these words is case title is ignored. Therefore, if j th keyword doesnt
called synonyms in the ontology. Besides, sometimes exist in the title of the case Ci , in another word, j th
users prefer taking some well-known words for short, for
keyword isnt important in the title of the case Ci , and
example, online-offline is often replaced with OO for
then tij 0 . Else, j th keyword is very important in the
short. So, when one user enters some key words, it can be
extended by ontology to its synonym. title of the case Ci , then tij 1 .
Secondlythere are many concepts with ambiguities However, while computing the value of SIM IN (C C*) , i

and pragmatic environmental differences. In many cases, the frequency of each keyword in the vector space in
the phenomenon of polysemy appears. Take the initial problem description part of every case should be
word project as an example, it may mean engineering considered, as the descriptions of cases are more
in the broad perspective, or scientific research in the complicated and can reflect much information. So, the
scientific field, while sometimes it means program or frequency of j th keyword will be numbered. In this part,
planning. Besides, when user input finance as his let tij i 1, 2, ,n ; j 1, 2,...mdenote the frequency
application industry demand, it may be considered as the of j th keyword in the description of i th case and m is the
government industry or financial industry. If a record of total number of the terms. So in the same m dimension
the user's personal information is e-government vector space the description of i th case can be represented
consultant and many of historical records are related with
by Ti ti1 , ti 2 tim . Then we can realize that value
the government program rather than financial industry
project, we will prefer to define the application industry of denotes the importance of j th keyword.
tij
as government. Thus, in order to eliminate the ambiguity, As tij 1 , the following step is to regress the feature
we should consider users background information firstly vector of the case description to [0, 1] for convenient
to clarify its specific and tacit implications calculation. Let
Thirdly, there are some words of superordinate and
subordinate concept. In many cases, only through this

2012 ACADEMY PUBLISHER


1216 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

t
tij min
1 j m
ij
users demand group in the matching set expressed by G ,
d ij (2) that is G {Ci |SIM (CiC* ) } .
max t min t
If G , then decompose the query vector by level in
ij ij
1 j m 1 j m

In this way, not only the title part but also the case accordance with ontology and calculate the similarity for
description part can be represented in the following

searching the matching set which is not .Mark this
expression D d , d
i d . Particularly in the title part
i1 i2 im level as starting point, and combine the cases from down

as the value of tij , then D d , d d is i i1 i2 im to up with the ontology rules which can coordinate the
constraint and standard between different cases in the
the same as Ti ti1 , ti 2 tim .
process of combination. Two methods of combination are
Set the users query vector as standard, and then the appropriate: exhaustion and genetic algorithm. When the
query vector space should be C* W I (w1 , w2 ,...w j ,..., wm ) number of case combination choices is small, that means
I represents a unit matrix. W is set with the scoring all possible combinations can be listed and the feasible
method. Only the user can understand his demand most solution can be found out. We consider the one who got
the largest similarity with the new case as the optimal
clearly, therefore, Wj is calculated by the keywords
solution. Otherwise, genetic algorithm will be more
preferences of the user. We divide the users keyword efficient. Then, the process of case adjustment begins.
preference level into the following fuzzy set:{very
Y
important, important, common, unimportant, very Decompose
query vector
Match
sub-vector
Input query Calculate
unimportant}. For quantitative analysis, the fuzzy set can Start
vector similarity
G=? Y
?
be mapped into the vector{54321}. If the users N
N
preference vector for the m keyword Adjust
case
Combine
cases

is ( x1 , x2 ,..., xm ) 1 x j 5, and integer, then End Store


Y
Evaluate
case
xj
wj = Merge N Learn Y
m
(3) SIM ?
x
Satisfied?
Cases the case
j
j 1 N
Adjust N Y
Therefore, the similarity between the query slightly
Modify?

information and source case information in different part


is calculated with 4and5respectively. Figure5. Flow chart of dynamic management
If G the process of case combination skips and
m

SIM (C C* )
TI
w cos(T , C* )
i (4) j i
directly gets into the case adjustment process. As case
j 1
m
adjustment will also need to consider some other
SIM IN (Ci , C * ) w j
COS( Di , C * )
j 1 information besides the case similarity, it should be
m
Di C * adjusted and decided according to case description,
w j

solutions and other affiliated information with the help of
j 1 D C*
i (5)
m
persons, in the manual or semi-automatic way. We define
m d ij
c*j the case after adjustment as target case, which is more
wj
m
j 1

m
close to new problem than the source case.
d c
j 1 2
ij
*2
j The case is evaluated in practice, and then a new case
j 1 j 1
which meets our requirements will be learned. In order to
While computing the similarity of query vector C * and avoid the redundant information in the knowledge
semantic vector C0 C1 Cn ,remember to represent the repository, similarity between the new case and the
largest similarity as SIM (C C* ) .If the value is larger
max i source case will be also calculated to determine the new
than or equal to the given threshold , the case will be case will be stored or be processed further. Set certain
added into the search results. Finally, the search results threshold , S represents source case while N is new
sort according to the similarity and be sent to the user. case, so if SIM ( S , N ) , store the new case; else, merge
D. Case Reuse and Learning them together.
Generally speaking, the scale of the initial case base is In this process, the ontology provides the semantic
so limited that cannot satisfy all the different needs of support for the case retrieval, matching, combination,
customers, so the dynamic management mechanism adjustment and case learning.
becomes urgent. It mainly includes the realization of the E. Validation and Analysis
case reuse, modification and case learning on the basis of In order to verify the retrieval effectiveness, according
case retrieval. Fig. 5 shows the flow chart of this dynamic to the research work in this paper, a simple prototype
mechanism. system of consulting industry is developed based on SQL
In the process of case reuse, two main problems should database and C # language.
be considered. One is the differences between the new Its difficult to make the case database rich at the very
problem and the retrieval results. The other is which part beginning. Therefore, we mainly collect relevant data
of the result can be reused. We define those cases met the from the enterprise original relational database system,
Intranet, and project materials and then sort into 40

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1217

information system project consulting cases with analysis. got W =[1/4,5/16,5/16,1/8]T with (3). According to the
These cases are deposited into the knowledge repository domain ontology above and the users information, the
in the description and organization way illustrated above, retrieval vector can be expressed more precisely with
which can be seen as the preparation for the operation. semantic and pragmatic expansion, which is shown in
Suppose that a consultant who is mainly responsible detail in the Table. I. Particularly, the dimension of
for e-government projects needs to retrieve some finance is expressed as government rather than financial
information of history projects which provide project industry, because the information of the consultant shows
management services for video conferencing system in that he is responsible for the E-government services and
the finance field. Besides, if Ministry of Finance is main the major project stakeholder is Ministry of Finance.
project stakeholders, that will be better. So, the key
retrieval vector is extracted as Industry, Project Types,
Services and Main Project Stakeholders. This consultant
enters the following vector value:
Tij [ Finance, Video conferencing system, Project Management , Ministry of T
Finance]

The case retrieval interface is shown in Fig. 6.

Figure 7 Weights determination interface


In this paper, we set 0.6, 0.4 the threshold
value of similarity 0.5 then the consultants retrieval
vector was: C* W T I (1 / 4, 5 / 16, 5 / 16, 1 / 8)T . Computing
with the retrieval algorithm and the expansion vector
Figure 6 User query interface shown above, cases those meet the formula SIM (CiC* )
Also, his preferences for these keywords are defined are delivered as Fig.8.
successively as important, very important, very important,
and unimportant respectively, which is shown in Fig.7. In
this way, the score could be described as ( 4 ,5 ,5 , 2 ) and
TABLE I.
RETRIEVAL FEATURE VECTOR AND SEMANTIC EXPAND INFORMATION

tij xj wj Vector Value synonymy Pragmatics Hyponymy


Government(Y);
Industry 4 1/4 Finance Financial Industry(N)
Government

Project Types 5 5/16 Video conferencing system Video session system Multi-media

Services 5 5/16 Project Management PM; Supervisor Management


Main Project Treasury Department Person;
Stakeholders
2 1/8 Ministry of Finance
Finance Bureau
Organization

Take the optimal matching case 7 as example. The


process is described as follows.
TTI = (1 / 4, 5 / 16, 5 / 16, 1 / 8)T
SIM (C7 , C* )=1
TI
TIN 1, 4, 10, 1

D7 0, 1 / 3, 1
0

SIM (C7 , C* ) 0.7559


SIM (C7 , C* ) 0.6 1 0.4 0.7559 = 0.9024
The traditional way of information retrieval is based on
keywords that the computer cannot understand users
Figure8. The matching cases and similarities potential semantic and personalized query demand.
Therefore, we conduct the semantic and pragmatic
retrieval research based on the domain ontology and

2012 ACADEMY PUBLISHER


1218 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

users information. After the analysis, semantic and Stanford: Stanford University Knowledge System
pragmatic retrieval can get the best precision and recall Laboratory, 1992.
rate of the matching result among traditional retrieval [5] Norbert Gronau, Frank Laskowski .Using Case- Based
method, sematic and pragmatic method, as is shown in Reasoning to Improve Information Retrieval in Knowledge
Management Systems [J].Springer-Verlag GmbH, 2003;
TABLE II. 2663, pp.94-102.
TABLE II.
THE COMPARISON OF THE RETRIEVAL METHOD [6] Wimalasuriya, Daya C., Dejing Dou. Ontology-based
information extraction: An introduction and a survey of
Retrieval Actual current approaches [J].Journal of Information Science,
Retrieval
Matching Case Precision Recall June 2010, vol.36, pp.306-323.
Results
Cases Number [7] Wenhuan Lu, Ikeda, Mitsuru.A uniform conceptual model
Traditional
searching 3 2 8 0.6667 0.375 for knowledge management of international copyright
Semantic
law[J].Journal of Information Science, February
retrieval
5 4 8 0.8 0.5 2008,vol.34,pp.93-109.
Semantic [8] Akbari, Ismail, Fathian, Mohammad. A novel algorithm
and for ontology matching [J].Journal of Information Science,
pragmatic
8 7 8 0.875 0.875 June 2010, vol.36:pp.324-334.
retrieval [9] Chimay J. Anumba, Raja R.A. Issa , Jiayi Pan, Ivan
Mutis.Ontology-based information and knowledge
VI. CONCLUSION management in construction[J].Journal of Knowledge
Management,2008, Vol.8,Iss3, pp.218239.
The dynamic knowledge repository construction is a [10] Liao Liangcai, Qin Wei, Shu Yu. Ontology-based Dynamic
hot issue in the field of knowledge engineering at present. Knowledge Management System [J].Computer
This paper focuses on the construction method of Engineering, 2009, 3516, pp.256-261.
knowledge repository based on ontology and CBR. An [11] Gao Huiying, Yan Zhijun. CBR based Multi-agent System
integrated framework is put forward and the operation Model for Case Retrieval of Information Systems
processes of dynamic knowledge repository system are [J].Computer Engineering and Design, 2008, 29(5),
analyzed. With the construction of ontology the query of pp.1226-1228.
[12] Gao Huiying, Zhao Jinghua. Ontology-based Enterprise
the user can be expanded with more semantic information
Content Retrieval Method [J].Journal of Computers, 2010,
and the case database can be formed more precisely. A 5(2), pp.314-321.
retrieval algorithm is designed and the dynamic [13] Huiying Gao, Qian Zhu. Semantic Web based Multi-agent
mechanism of the knowledge management is illustrated. Model for the Web Service Retrieval. Proceedings of
Moreover, taking a consulting company as the case study International Symposium on Computer Network and
background, the retrieval process is verified, which Multimedia Technology, 2009.12, pp.897-900.
proves the high accuracy and completeness of the
retrieval algorithm.
As the initial case samples are small and centralized,
further work should be done to conduct quantitative
analysis based on a large size and wide distribution of
case samples. Huiying Gao, Dr. Associate professor, was born in
Shandong Province, China, in 1976. She received her
ACKNOWLEDGMENT doctor degree in management science and engineering
This work was supported in part by the National from Beijing Institute of Technology in 2003. Now she is
Natural Science Foundation of China under Grant an associate professor in the school of Management and
71102111 and Beijing Institute of Technology under Economics, Beijing Institute of Technology. She has ever
Grant 3210050320908. been in Technical University of Berlin, Germany to do
her Ph.D. research work from 2002 to 2003. From 2008
REFERENCES to 2009 she has visited Karlsruhe Institute of Technology,
Germany for half a year. Her current research interests
[1] Annette M. Mills, Trevor A. Smith. Knowledge include theory and method of information systems,
management and organizational performance: a content and knowledge management, semantic retrieval,
decomposed view [J]. Journal of Knowledge Management,
intelligent information system etc. Ph: +86 (10)
2011, Vol.15 Iss1, pp.156171.
[2] K. D. Joshi, Lei Chi, Avimanyu Datta, Shu Han. Changing 68918830.
the Competitive Landscape: Continuous Innovation
Through IT-Enabled Knowledge Capabilities
[J].Information Systems Research, Vol.21, No.3,
September 2010, pp.472-495. Xiuxiu Chen was born in Shandong Province, China,
[3] Chou, Shih-Wei. Knowledge creation: absorptive capacity, 1988. She is a Ph.D. degree candidate in the school of
organizational mechanisms, and knowledge Management and Economics, Beijing Institute of
storage/retrieval capabilities [J].Journal of Information Technology. Her research work includes information
Science, December 2005, vol.31, pp. 453-465.
management and knowledge management. Ph: +86 (10)
[4] Stanford University Knowledge System Laboratory. A
translation approach to portable ontology specifications [R]. 68918830.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1219

Convexity Conditions for Parameterized Surfaces


Kui Fang
Institute of Information Science & Technology, Hunan Agricultural University, Changsha, P. R. China
Email:fk@hunau.net

Lu-Ming Shen
Science College, Hunan Agricultural University, Changsha, P. R. China
Email: lum_s@126.com

Xiang-Yang Xu
Institute of Computer Science ,Changsha University, Changsha, P. R. China
Email: Xiang-Yang Xu@ccsu.com

Jing Song
Institute of Information Science & Technology, Hunan Agricultural University, Changsha, P. R. China
Email:jingsong2004172@126.com

AbstractBased on a geometrical method, the internal presented several sufficient conditions for parameterized
relationships between locally parameterized curves and the tensor product of B-splint convexity surface, especially,
local parameterized surfaces are analyzed. A necessary and many excellent results as to the convexity research of
sufficient condition is derived for the local convexity of Bezier-surface in triangle region were obtained by the
parameterized surfaces and functional surfaces. A criterion
research group leaded by G. Z. Chang. But those results
for local convexity (concavity) of parameterized surfaces is
found, also, the criterion condition of binary function cant be used to analyze the convexity of general
convex surfaces is obtained. Finally, the relationships parameterized surfaces, since the above results are
between a globally parameterized curves surfaces is obtained by particular methods. At present, there are still
discussed, a necessary condition is presented for the global some scholars devote to the study of the convexity of
convexity of parameterized surfaces , and it is proved that binary function surface, Lia [7] and K. Fang generalized
locally convex parameterized surfaces are also globally the determining method of convex function to binary
convex. functional surface, and they derived the necessary and
sufficient condition of the binary functional surface.
Index Termslocal convexity, global convexity, gauss
Dahmen[9] presented the convexity condition of
curvature, the second fundamental form
multinomial Bernstein-Bzier and trunk splint.
Typical convexity surface is the global one, such as
I. INTRODUCTION ovum surface, the accurate definition of which is defined
in differential geometry, but there is no algebra
With the development of CAGD technology, the determining method for the convexity. B. Q. Su[12]
geometrical morphological analysis of parameterized defined the local convex surface by Gauss curvature
surface has become an important subfield of CAGD, K 0 , also, by the non-negativity (or non-positive) of
which is the research content of differential geometry in normal curvature K n , Koras and Kaklis[5] defined the
the analysis and study of free parameters surface. The
geometrical design theorem is the foundation of the local convex surface, thus, the necessary and sufficient
application, such as modeling and design of industrial condition for the local convexity is the second
product, Computer Aided Process and Analysis. In this fundamental form of curved surface 2 0( 0) . Since
paper, the local and global convexity analysis of the both Gauss curvature and normal curvature describe only
regular parameterized surface is studied. the local properties of curved surface, for some point on
To the convexity research of curves and surfaces, which, the Gauss curvature K 0 means that, in the
many scholars have obtained some results. Convexity vicinity of this point, all normal section lines passing
problems of general and special parameterized curves, through this point bend to the same direction. The
such as Bzier-Curves, B-splint-Curves, have been solved nonnegative (or non-positive) of the normal curvature at
by C. Liu and C. R. Trass[1], Dingyuan Liu[2], B.de some point means that when the curvatures of the normal
Boor[3]. However, the surfaces convexity study is always section lines which pass this point are not null, the
an interesting topic. H. Wai[4] derived the convexity bending directions of normal section lines are the same,
condition of Bernstein-Bzier multinomial surface in on the contrary, for zero curvature, the bending directions
rectangle region, G. D. Koras and P. D. Kaklis[5] cannot be determined. Thus, with Gauss curvature K 0 ,

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1219-1226
1220 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

it is not comprehensive to define the local convex surface, It can be discerned that global convex curves are also
and so does for normal curvature. On the other hand, if local convex ones. The following lemma is a
the geometry definition of local convex curve is discriminating theorem for local convex curves.
generalized to convex surface, it will be too rigor. Based Lemma2.1[1] A parameterized curve is local
on the above, we present a reasonable geometrical convex, if and only if its relative curvature r are
definition for local convex surface, based on this unchanged.
definition and with normal curvature, we can connect the For simplicity, we assume that the arc-length
local convexity of parameterized surface with the second parameterized curve of plane parameterized curve is
foundational form of curved surface together, then the : r r ( s) (2.1)
necessary and sufficient condition for local convexity of
parameterized surfaces, which is a algebraic inequality We set the plane which curve lies in is a directional
determined by the second foundational form of curved one, and its direction vector is k . Let be the unit
surface, is derived. For binary functional surface, as a tangent vector, and write N k the major normal
special parameterized one, based on the determining vector of plane curve , then the Frenet Formula of curve
condition of local convexity, it is easy to get the is:
necessary and sufficient expressions of functional convex d ds r ( s )N
surfaces. Meanwhile, determining conditions for local (2.2)
dN ds r ( s )
convexity (concavity) for parameterized surfaces are
established, and a necessary and sufficient condition is where r ( s) is the relative curvature of the plane curve.
presented for the global convexity of parameterized Theorem2.1 Let be a plane curve in C 2 , and if the
surfaces. Finally, we prove that the local parameterized curvature of at point P (corresponding to parameter t )
closed convex surface is a global convex surface. is r 0( r 0) , then there exists a neighborhood of t ,
and the segments corresponding to which lie in the left
II. Definition of convex curve closed half-plane (right closed half-plane) of the tangent
line at point P .
Definition2.1 A plane parameterized curve : r r(t ) , Proof: For simplicity, we just discuss the arc
a t b is an ordered set in Euclidean plane R2, when its parameterized curve. By the Taylor expansion at P
direction is from t a to t b . (corresponding to parameter s ), we have
If the direction of the curve is anticlockwise, then with 1
the transformation t a b t , the curve direction can be
[r( s s) r( s)] rs (r )s 2 (2.3)
2
changed to the clockwise one, therefore, in this paper, the Where 1 2 N , and lim 0 .
direction of parameterized curve can be assumed as s o

clockwise one. With Frenet Formulae, obtain r , r r N , combing


Definition2.2 For parameterized curve : r r(t ), a with equation(2.3), we have
t b , we call is a regular curve, if r ' (t ) 0 . 1 1
[r( s s) r( s)] s(1 1s) ( r 2 )N(s) 2
In this paper, only the regular curves in R2 are 2 2
considered, and for every point of the curve, the second 2.4
derivative exists. Hence,
Since the positive direction of the tangent vector at the 1
parameterized curve is coincident with the increment of
[r( s s) r( s)] N ( r 2 )(s) 2 2.5
2
parameter t , we define the tangent vector direction as that If the curvature r 0( r 0) , then by (2.5), for s s-
of tangent line. For plane curve, the tangent line divides
mall enough, we have ( r 2 ) 0( 0)
the plane which the curve lies in into two half plane, and
along the tangent direction, the half surface that on the The above formula shows that there is a neighborhood
night of the curve is called the right surface, and the other of t , such that the corresponding curve segment locates
surface is called the left one. The half surface which in the left closed half-plane (right closed half-plane) of
contains the tangent line is called as the closed half-plane. the tangent at P . This finished the proof of theorem 2.1.
From the view of global convexity, K. Wilhelml [11] The following theorem can be easily deduced by
gave the definition of global convex curves as follows: theorem 2.1 and the definition of global convex curve.
Definition2.3 Let P be an arbitrary point on , we call Theorem2.2 Let be a local convex curve in C 2 , and
a global convex curve, if it lies in the right closed half- the curvature of which is r 0( r 0) , then for any P
plane (left closed half-plane) of the tangent line at P . of , there exists a neighborhood such that the curve
Straight lines are special global convex curves. segments corresponding to which locate in the left closed
However, the definition of local convex curve is: half-plane (right closed half-plane) of the tangent line.
Definition2.4 Let P be an arbitrary point on the By theorem 2.2, and the definition of global convex
regular plane curve , if there exists a neighborhood of curve, we have
P , and in which the segment corresponding to is
Theorem2.3 Let be a local convex curve in C 2 , and
located in the right closed half-plane of the tangent line at
point P , then is called a local convex curve. the curvature of which is r 0( r 0) , then for any P

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1221

of , lies in the left closed half-plane (right closed For the Gauss curvature in definition 3.3, the author
half-plane) of the tangent line at P . assumed K 0 , rather than K 0 . The following is an
By (2.4), we have example.
1 Let C : r(u) {x(u), y(u),0} be a global convex surface on
[r( s s) r( s)] s(1 1s)
2 plane XY, take C as the traverse, and the unit assistant vector
then, for s 0( 0) small enough, (u) {0,0,1} of C as the direction vector of the straight line
1 generatrix , then we can engender the developable surface
[r ( s s) r ( s)] s(1 1s) 0( 0)
2 : p(u, v) r(u) (u)v (u, v) D
Consequently, we can derive the following theorem: For p0 (u0 , v0 ) D , n 0 is the normal vector of at p 0 , if
Theorem2.4 Let be a local convex curve in C 2 , and p(u, v) p0 (u0 , v0 ) , then
P (corresponding to parameter t ) be any point of ,then
in the neighborhood of P , the segments of
(p(u, v) p(u0 , v0 )) n0 [r(u) r(u0 ) (u)v (u0 )v0 )] n0
corresponding to t t lie on the right or left of the (r(u) r(u0 )) n0
major normal according to t 0 or t 0 . Due to the definition of the global convex curve, we have
(p(u, v) p(u0 , v0 )) n 0( 0)
thus, is global convex.
.DETERMINING CONDITION FOR LOCAL CONVEX Above example explains that there exists a surface
SURFACE
(developable surface) whose Gauss curvature k is 0 , but it is
global convex. Then, it is not comprehensive to define a convex
Before discussing the convex surfaces, several conceptions
surface with Gauss curvature K 0 .
are presented. For parameterized surface
If the definition of local convex curve is generalized to the
: r r(u, v) global one directly, i.e., for any point P on the regular , there
we write exists a neighborhood of P , such that for any point of which,
r (u , v) r (u , v) 2 r (u, v) 1
ru rv ruu (0 )0 (r p) n (n r n )(ds) 2
u v uu 2
2 r (u, v) 2 r (u, v) Suppose that is local convex, if n r 0 , then
ruv rvv
uv vv 2 n rds 2 is the principal part of , and its sign is the
ru rv same with ; if n r 0 implies 2 0 , then,
and call n
ru rv 2 0 ( 0 )is the necessary condition of local convex. On
the normal tangent, and the positive direction of normal vector the contrary, if 2 0 implies n r 0 , then the positive or
n the positive side of . Also, we call negative sign is uncertain, consequently, for the local convex
E ru ru F ru rv G rv rv surface, and 2 are not necessary to posse the same positive
the first fundamental form of a surface, and or negative sign. Thus, it is too rigorous to define the local
L n ruu M n ruv N n rvv convex surface with the geometrical method, and also it is hard
the second fundamental one. to find out an algebraic discriminating method.
For any point P on , the tangent plane at which is , the Koras& Kaklis[5] defined the local convex surface with
semi space which the normal vector directs to is the upper semi- normal curvature directly.
space, the other one is the lower semi-space. Also, the semi- Definition3.4 Let : r(u, v) , (u , v ) D R be a regular
2

space which contains is called the upper closed half-space. surface, and P be any point of , if along any direction at P ,
From the view of global convexity, K.wilhelm [11] presented
the normal curvatures kn 0( 0) , then is a local convex
the primeval geometrical definition of the convex surface.
Definition3.1 For parameterized surface surface.
In differential geometry terminology, for any P of , the
: r(u, v), (u , v) D R 2
geometric meaning of normal curvature kn 0( 0) is that,
If ru rv 0 , then is called a regular surface.
when the curvatures of normal section lines at P is not null,
Definition3.2 For regular a surface then directions of the bending direction of normal section lines
: r (u, v) , (u, v) D R 2 are all the same. However, when it is null, the bending
If lies totally in the upper closed semi-space (lower closed directions of normal section can not be determined. On the other
semi-space) of the tangent plane at P r(u, v) , then is hand, because the signs of 2 and normal curvature K n are the
global convex. same, we can not ensure that K n and possess the same sign,
Planes are special global convex surfaces. It is difficult to hence, if the local convex surface is defined with normal
discriminate the global convexity of a surface, we can only curvature kn 0( 0) , it will not be consistent with the
prove it with the primeval geometrical definition, but there is no
algebra method till now. However, B. Q. Su[12] presented the geometry definition.
local algebra definition of a convex surface (local convex A reasonable geometric definition for local convex surface is
surface actually) with Gauss curvature. given as follows.
Definition3.5
Definition3.3 Given : r(u, v) , (u, v) D , for P ,
Let : r (u , v) , (u, v) D R be a regular surface, P
2
if Gauss curvature K 0 , then is a convex surface.
be any point of , if for any normal section line which crosses

2012 ACADEMY PUBLISHER


1222 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

P , there exists some neighborhood of P , in which the Proof: For a fixed point P(u, v) on , let n be the unit
segment corresponding to the normal section line lies totally in
normal vector of at P , and the positive direction of the
the upper or lower half-space of the tangent space at P , then
tangent plane of at P is determined by n , then is a
is local convex surface.
directed plane. By the Talor expansion, for all normal sections
Obviously, the global convex surface is a local convex one.
Now, we will provide a necessary condition for the local convex which passes through P , we have
surface by algebraic analysis method. 1
Theorem3.1 r ( s ) r ( s ) r s r s 2 s 2
The necessary condition for a regular surface 2!
: r (u, v), (u, v) D R 2 to be local convex is that for any where s s s , lim 0 , then the directed distance
s o
point on , the Gauss curvature K 0 . between and tangent plane is
Proof: Since is a regular surface, the Gauss curvature
1
exists at any point on . Suppose that the Gauss curvature [r ( s ) r ( s )]n r ns 2 ns 2 (3.4)
K p 0 at P , i.e., P is a hyperbolic point, at which there 2!
by differential geometry[10] theory, we have
exists two asymptotic directions, they form two pairs of vertical
angles, and in which the normal section lines at P bend to the r ns 2 r nds 2 Ldu 2 2Mdudv Ndv 2 (3.5)
opposite side of the tangent plane respectively, which is which is the second basic form of a surface.
contradicted to the definition of local convex surface,
Substituting r r N to equation (3.5), we have
therefore, K 0 . This finishes the proof of Theorem 3.1.
Before the discussion of the discriminating condition of
r N nds 2 Ldu 2 2Mdudv Ndv 2 (3.6)
convexity, conception of positive semi-definite matrix is stated. Write f (s ) N n , observing the equation (3.2), for P ,
Definition3.6 Let f ( x1 , x2 , , xn ) XAX , where A is we have f ( s ) N n nn 1 .
real symmetric matrix of order-N, if for any group of real By the properties of continued function, there exists a
number c1 , c2 , , cn that are not all zeros, neighborhood ( s ) of s , such that for any point in
f (c1 , , cn ) 0( 0) , then f (c1 , , cn ) is called positive ( s ) , have 0 f (s ) 1 . Hence by lemma 3.2, for the
(negative) semi-definite, and A is called a positive (negative)
neighborhood ( s ) , the relative curvature

semi-definite matrix. r 0( 0) of
Lemma 3.1[13] A real symmetric matrices A is semi- , if and only if
positive define if and only if all the principal minor larger or
equal to 0. K (u, v) Ldu 2 2Mdudv Ndv 2 0( 0)
Now we fix a point p(u, v) on the regular surface , the is a semi-positive (semi-negative) define matrix.
unit normal vector of on P is n, and write it the In view of theorem 2.2, for the normal section segment

normal sections through P , at P , when r 0( 0) , there exists a neighborhood
: r (s ) r (u(s ), v(s )) (3.1) ( s ) such that in which lies in left closed semi plane
where P r ( s ) , s is the arc length parametric of . (right closed semi-plane) of the tangent line at P , i.e. lies
The unit tangent vector of normal vector is , in in the upper closed semi-plane(lower closed semi-plane) of the
tangent line at P entirely, since P is arbitrary, is a local
particular, the unit tangent vector at P is . We assume the convex surface.
plane which the normal section lies in is a directional plane, and Combining theorem 3.1 and lemma 3.1, we can get the
the directional vector is k n , write N k the following theorem.
Theorem 3.2
principle normal vector of , then N directs to the left semi- A regular surface : r (u, v)(r (u, v) C ( D), D R ) is
2 2

plane of the tangent line, and the principle normal vector of local convex if and only if
at P is L 0, N 0LN-M 2 0 (3.7)
N ( n) ( )n (n ) n (3.2) or
by Lemma 3.1, we have the following lemma. L 0, N 0LN-M 2 0 (3.8)
Lemma 3.2 The second basic form of a surface Corrolary3.1 Let be a regular local convex surface, then
Ldu 2 2Mdudv Ndv 2 0( 0) K (u, v) is semi-positive (semi-negative) define if and only if
if and only if for (u, v) D , for any point of , the tangent vector directs to the concave
L M side (convex side) of .
K (u , v) (3.3) Proof: For the fixed point P(u, v) of , let n be the unit
M N
normal vector of at P , and the positive direction of the tan-
is a semi-positive( semi-negative) define matrix.
gent plane of at P is determined by n (the positive
Theorem 3.1
(The criterion of local convex surface)A regular surface direction of n is relative to the selection of parametric),.by
(3.6) and the proof of theorem 3.2, in the neighborhood of P ,
: r(u, v) (r (u, v) C 2 ( D), D R 2 ) is local convex if and
all the normal sections through P bend to the upper closed
only if for (u, v) D , K (u, v) is a semi-positive(semi
semi-space(lower semi-space) of the tangent plane , hence,
negative) define matrix.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1223

for any point on , normal vector direct to the concave(convex) lies in the lower semi-space of the tangent plane , i.e.
side of the surface. lies totally on some side of the tangent line L , thus is a
For a closed convex surface , since the unit normal vector global convex curve.
n is a continuous vector function, we have
Corollary 3.2
. THE CRITERION CONDITION FOR THE GLOBAL CLOSED
Let be a regular local closed convex surface, then
CONVEX SURFACE
K (u, v) is semi-positive define (semi-negative define) if and
only if for any point on , the normal vector of which direct to Lemma 5.1[10]
the interior (exterior) side. A simple and regular closed curve of a plane is global convex
Let z f ( xy) be a binary function defined on D, and the if and only if the relative curvature r ( s ) keeps the same sign.
surface which it corresponds to is and the parametric form Lemma 5.2
of which is : r( x, y) {x, y, f ( x, y)} , a simple calculation The regular closed curve of a plane is global convex if and
shows that only if it is a local convex one and does not contain double
points.
f xx f xy
L M Proof: Sufficiency: Suppose that the regular closed curve
1 fx2 f y2 1 fx2 f y2 in a plane is local convex, i.e. (s) keeps sign. Since for each
f yy point on the regular curve : r r( s) , there exists a
N
neighborhood, such that in which the vector function r ( s )
1 fx2 f y2
corresponding to the curve is one to one, and there is no double
by theorem 3.2 and corollary 3.1, we have theorem3.3 Binary points on the curve, then the vector function corresponding to
function surface : z f ( x, y) is a convex function if and
the regular curve : r r( s) is one to one, i.e. is a simple
only if
curve. Hence, by lemma 5.1, is a global convex curve, which
f xx 0, f yy 0f xx f yy -f xy 2 0 (3.9) finishes the proof of sufficiency.
And is the lower convex function if and only if Necessity Let P be a double point of the global convex
f xx 0, f yy 0f xx f yy -f xy 2 0 (3.10) curve, and the corresponding two parameters are t1 and t2 , then
corresponding to [t1 , t2 ] is a piece of regular closed curve of
. THE NECESSARY CONDITION FOR THE GLOBAL . On the other hand, if there exists a point Q belongs to
CONVEX SURFACE
but not to , then write O the nearest point from Q to ,and
Firstly, we present the following lemma. the tangent line through O is T , thus OQ T , i.e.
Lemma 4.1 Let M be any point on the regular global convex
surface , be an arbitrary normal section through M on OQ coincides with the principal vector N at O . By theorem
, and lies in , then the tangent plane of any point on 2.4, there exists a point of lies in both sides of the principal
can not coincide with . vector N , and N passes through inevitably. Assume that
Proof; If the tangent plane of at P is also , by the lies in the right semi-plane of the tangent line T , then Q
global convexity of , we can assume that lies in the lower
semi-space of the tangent . Then, we can draw another lies in the left semi plane of the tangent line T , and there exists
points of on both sides of the left and right semi plane of the
normal section at M through some other direction (rather tangent line T which is contradicted with the fact that is
than the direction of the tangent line), and the plane which global convex, hence there does not exist any point of
lies in is , such that and intersects but not coincides outside .
with each other. By theorem 2.4, in the neighborhood of M ,
Simultaneously, if there exists Q on lies in , then the
lies on two sides of the normal line with normal vector n ,
tangent line through Q inevitably passes through which is
then there exists some point on lies in the upper semi-space
of tangent plane of , which is contradicted with the fact contradicted with the global convexity of , thus, there does
that lies in the lower semi-space of the tangent plane . not exist any points of on . By above facts, there does not
Theorem 4.1 exist double point on . This finishes the proof of the lemma.
(The necessary condition for global convex)A regular surface Lemma 5.3
: r(u, v) (r (u, v) C 2 ( D), D R 2 ) is global convex if and Let be some regular local closed convex surface, be a
only if the normal section through any point of the surface is a normal section of at some point, and the plane which lies
global convex curve. in is then the tangent plane at any point of can not be
Proof Let M be any point of the regular convex surface , the plane .
is the normal section of through M , and lies in , Proof If the tangent plane of at P is also , since
then for M by the global convexity of , lies on some divides into two open surfaces, we can write 1 and 2 the
side of the tangent line through M . For any point P on , we
upper semi-open surface and the lower one. Draw the normal
can draw the tangent plane of through P , by lemma 4.1,
section of any direction at P , the parts of on 1 and 2
intersects with , and denote L the intersection of the
planes above. Since L lies in both and , then L is the is written as 1 and 2 respectively. Then, there exists a
tangent line of at P . By global convexity, we can assume neighborhood of P , such that in which the curve segment

2012 ACADEMY PUBLISHER


1224 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

corresponds to 1 and 2 in the upper and lower semi space 0 F (P) N n 1


respectively, which is contradicted with the local convexity of then
. 0 (P) / 2
Lemma 5.4[10] Let be the angel between the tangent
x Since the (P) is continuous, then 0 ( P ) / 2 , i.e.
vector and positive direction of axis, then
d (t ) dr 0 F (P) Nn 1 , By equation(3.6), we have
(s)
dt dt r Nnds 2 r F (P)ds 2 Ldu 2 2 Mdudv Ndv 2 0
i.e., is monotone decreasing when (s) 0 , and monotone (3.11)

increasing when (s) 0 .


We assume that : r r(t ), a t b is a piecewise regular
plane curve, that is, there exists parametric series
a a0 a1 an 1 an b
such that the curves is regular in the each intervals
(a j , a j 1 )( j 0,1, , n 1) , and r ( a j ) is the corner point.
For the piecewise closed plane curve, the rotation number is Figure 1. Closed curve with double point.
defined as [10]
n 1 n 1
1 1
nc
2
[ (a
j 0
j 1 ) (a j )]
2

j 0
j

where j is the exterior angle of on the corner point


r ( a j ) , namely, the direction angel is from the tangent
direction r ( a j ) to r ( a j ) , and j .
Figure 2. Closed curve with double point.
Particularly, the above formula is
1 i.e., for each point on , the relative curvature r 0 , thus
[ (b) (a ) 0 ]
nc
2 is local convex. If there exists some double point on
and when the plan curve contains only one corner, the (Figure 1), since for normal section , there does not exist
tangent line rotates
intersecting curve, are two independent closed curves, and
(b) (a) 2 nc 0 there is only one common point, and there is a hole in the
Lemma 5.5[10] curved surface body which contains in, which is impossible
(The rotation number theorem) If the plane curve is for a closed curve. If there exists a double point on (figure 2),
piecewise regular, simple and closed, then the rotation number
simultaneously, there exist two coinciding points A1 , A 2 on .
is nc 1 ,
Lemma 5.6 Suppose that T1 , T2 is the tangent line corresponding to
Let be a local regular closed convex surface, and be A1 , A2 , since is local convex, there exists a neighborhood
any normal section on through M , then is global convex.
of A1 ( A 2 )such that the locals the convex curve segment
Proof: Since is closed, and so does for . Let P be any
point on and the unit normal vector of at P is n the 1 ( 2 )lies in the left semi plane of T1 ( T2 ).
positive direction of tangent plane of at P is determined
by that of n , then is a directional plane. For simplicity, we
can assume that n directs to the interior of by corollary
3.2, the matrix K (u, v) corresponding to is positive semi-
definite.
Assume that which lies in is a directional one, the
tangent vector of at each point is , and the principal normal
vector is N . Figure 3..Local convex Bezier curve three times the structure.

For any point P on , write (P) the angel between N Two cases will be considered
and n , then 0 (P) , Write For the neighboring of A 2 , if there does not exist the

F (P) Nn neighborhood of A1 of 1 lies in the left semi space of the


If there exits P * on , such that (P*) / 2 , then tangent T2 , then there are points of lie in both the left and
F (P*) 0 i.e., the tangent plane of at P * coincides right semi space of T2 .
with the plane which lie in, which is contradicted with For the neighboring of A 2 , if there exists the
neighborhood of A1 of 1 lies in the left semi space of the
lemma 5.4, thus, (P) .
2 tangent T2 , then T2 is the tangent through A1 , and T1 , T2
Suppose that there exists P on such that

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1225

coincide each other, on the other hand, by r 0 , T1 , T2 to the interior of , by corollary 3.2, the matrix K (u, v)
possess the same direction. corresponding to is negative semi-definite.
For case, we take T1 as x axis, choose two points P1 , P2 Now, we prove that lies in the lower half-space. Let M
be any point on , then for M , we can draw a plane through
on 1 , and the corresponding tangent line is T *1 , T *2
n and M which intersects with , and the intersection is
respectively, the angel between T *1 , T1 , T *2 and the positive the normal section of at M , by lemma 5.4, is a global
direction of x is 1 , 0, 2 respectively, which are all monotone closed convex curve. Assume that the plane which lies in
is a directional one, for any P on , is the corresponding
increasing. Figure 5.3 is a local convex cubic Bzier curve * ,
tangent vector, and the tangent line is T , then for , we can
the start point and the end one of which is P1 , P2 respectively,
choose unit direction vector k such that the principal normal
and is tangent to T *1 , T *2 , then for , by substituting curve
vector of at M is N n . Assume that the principal normal
P1 A1 P2 with * , we can get a closed curve , which is a vector of at P is N , the unit normal of along at P
regular, simple closed curve. By the construction of * , we is n , by the proof process of lemma 5.4, the angel between N
know that for * , the relative curvature *r 0 , and so does and n is, 0 (P) / 2 i.e. 0 F (P) N n 1 , by
equation(3.6), we have
for , by lemma 2.1, is global convex, which is
contradicted with case . r Nnds 2 r F (P)ds 2 Ldu 2 2 Mdudv Ndv 2 0
For case , by lemma 5.5, the rotation numbers of is then for any point on , the relative curvature r 0 , by
nc 1 , i.e., as T *1 rotates a period along the parametric
theorem 2.3, for any P , lies in the closed right semi-
direction of , T *1 rotates 2 also. Also, as T *1 rotates to plane of the tangent line at P , i.e. lies in the closed lower
T *2 along 1 and P1 A1 P2 the rotation angel is the same, semi-space of the tangent plane of at P , as a consequence,
M lies in the closed lower semi-space of the tangent plane .
then as T *1 rotates a period along the parametric direction of
This finishes the proof of the theorem.
so does for T *1 .
Along the parametric direction, if we divide into two . CONCLUSION
regular and simple closed curves at A1 ( A 2 ), i.e.,
A1 P2 A2 (denoted by 3 ) and A2 P1 A1 (denoted by 4 ), then by For local convexity of any point on the parametric surface, if
it is defined by the local convexity of each normal section
Lemma 5.5, as the tangent line T1 of 3 rotates to T2 along passing through this point, then the local convexity of
the parametric direction, T1 rotates to 1 2 0 , where 0 is parametric surface and the second basic quantity of the surface
can be connected, and the algebra expression of the necessary
the exterior of the corner r( A1 ) of 1 , namely, the direction and sufficient condition for the determination of local convexity
surface is derived, which solves the determination method of the
angel from tangent direction r ( A1 ) to r ( A1 ) , and
local convexity of parametric surface well. In this paper, the
0 . necessary and sufficient condition [7] of discriminate the
Similarly, as the tangent line T2 of 4 rotates to T1 along convexity of binary function is a special case. Also, the
discriminate condition based on local convexity surface can be
the parametric direction, the rotation angel of T1 is 1 2 0 . applied for determining the convexity and concavity of a
Simultaneously, as the tangent line T2 of 4 rotates to T1 parametric surface easily. Also, we show that, for the local
parametric closed convex surface, the local convexity is
along the parametric direction again the rotation angel of T1 consistent with the global convexity, which means that the local
is also parametric closed convex surface is just a global one. Since the
1 2 0 determination of the global convexity of a parametric surface is
very difficult, in this paper, only a necessary condition is
Since is local convex, by Lemma5.4, the angel between presented for the determination of such surface, and the algebra
the tangent vector and positive direction of x axis is method is still unknown, which is a subject which is should be
monotone increasing, and as T1 rotates to T1 along the further researched.

parametric direction of , the total rotation angel of T1 is


ACKNOWLEDGMENT
1 2 4 , which is contradicted with case , thus,
This work is supported by NSFC(No.20206033) and
is a global convex curve. Scientific Key Planning Project of Hunan Province(No 2010J05)
Theorem 5.1 The regular closed surface global convex if
and only if is local convex. REFERENCES
ProofThe necessity can be deduced by the definition of
[1] C.Liu and C.R.Trass, On convexity of planar curves and
global convex surface.
its application in CAGD, Computer aided geometric
Sufficiency: Let M be a point on , n is the unit normal design, vol.14, pp. 653-669, December 1997.
vector of at M , and the positive direction of the tangent [2] D. Y. Liu, Convexity theorem on planar n-th Bzier
plane of at M then the tangent plane is a directional curves, Chinese annals of mathematics, vol.3, pp.45-55,
one. For simplicity, we can assume the normal vector n directs April 1982.
[3] B.De Boor, A practical guide to spline, Springer-
Verlag,.Berlin, 1988.

2012 ACADEMY PUBLISHER


1226 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[4] W. Xu, Convexity of Bernstein-Bzier polynomial surface Kui Fang was born in 1963. He received the Ph.D. in
on the rectangular region, Applied mathematics, vol.13, Computer science and technology from National
August 1990. University of Defense Technology of China, in 2000, and
[5] G. D. Koras and P.D.Kaklis, Convex conditions for the M.S Degree. degree in Computational mathematics
parametric tensor-product B-spline surfaces, Advances in
computational mathematics, vol.10, pp.291-309, October
from Xian Jiaotong University of China, in 1985. He is
1999. now a professor of Computer science and Technology at
[6] G. J. Wang, .Computer aided geometric design, Higher Hunan Agricultural University. His major research
education. 2001. interests include intelligent information processing,
[7] P.Lia, Convexity preserving Interpolation, Computer computer graphics, parallel computing.
aided geometric design, vol.16, pp.127-147, June 1999.
[8] K. Fang and X. H. Zhu, Convexity condition of bivariate
convex function, Pure and Applied Mathematics,
vol.24,pp.97-101,April 2008. Lu-Ming Shen was born in 1973. He received the M.S.
[9] W. Dahmen and C.A.Micchelli, Convexity of multivariate Degree in applied mathematics from Wuhan University, China,
Bernstein polynomial and Bos spline surface, Studa sci. in 2001. He is currently working towards the Ph.D. degree at
math. hungar, vol.23, pp.265-287, March 1998. Huazhong University of Science and Technology. He is now
[10] X. M. Mei, J. Z. Huang, Differential geometry, Higher
education , 1998.
an Associate Professor of applied mathematics at Hunan
[11] K.Wilhelm, A course in differential geometry, Spring- Agricultural University. His major research interests
Verlag, 1978. include Fractal geometry and its applications, computer
[12]B. Q. Su, Five lectures in differential geometry. Shanghai graphics.
Science and Technology press, 1979.
[13] Mathematics Department of Peking University. Higher
Algebra, Higher Education, 2003.
[14] K. Fang, The shape preserving interpolation theory and Xiang-Yang Xu was born in 1963. He Received the M.S.
Algorithm in Computer aided geometrical design, Hunan Degree. in Operational research theory from Jilin University of
people, Changsha. 2003. China, in 1988. He is now a professor of Computer science and
Technology at Changsha University. His major research
interests include information security and computer network

Jing Song was born in 1989. He is currently working


towards the M.S. at Hunan Agricultural University.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1227

Research on Automatic Management Model to


Personal Computer
Yalin Song
Complex Intelligent Network Institute, Henan University, Kaifeng, P. R. China
Computing Center, Henan University, Kaifeng, P. R. China

Xin He
Complex Intelligent Network Institute, Henan University, Kaifeng, P. R. China
Computing Center, Henan University, Kaifeng, P. R. China

AbstractThis paper describes the roles and functions of life etc. The computer system has become more and more
autonomic computing for personal computer. Based on the big and the structure between hardware and software is
technology of autonomic computing, this paper defines an also more complexity, but people have not capability
architectural model for personal computer autonomic enough to operate and manager this complex system. So
computing to solve some problems , which lead to the crisis
of software complexity. Based on this model, one test
we need urgently a kind of computer system having self-
program with self-monitoring, self-analyzing, self- healing monitoring, self-configuring, self-analysis and self-
characters is implemented. Some key values of knowledge controlling when some running programs on it exists bug
base for automatic management is proposed in this paper and error. Only in this way can computer become more
after we discussed and analyzed the influence of utilizing useful tool.
rate of CPU and usage rate of RAM and usage rate of In this paper, the autonomic computing concept is
network bandwidth on operation system in personal introduced in section 2; in section 3, Design and
computer. The results show that self-management is feasible implementation of architectural model of autonomic
for PC and is used in some application systems, and obtains computing based on Personal computer; in section 4,
good results.
performance of the autonomic computing model and
Index Termsautonomic computing, automatic algorithms are analyzed by experiments; in the last
management model, self-monitoring, self- healing section, conclusion of this paper is given.

II. AUTONOMIC COMPUTING


I. INTRODUCTION Because the processing data is more complex and the
Quality of services of information technology must be number of accessing device is more and more, the
improved, while reducing the total cost of ownership of traditional computing method (input data->execute
their operating environment. System deployment failures, computing->output result) has not to gratify our
hardware and software issues, and human errors can expectation now. Although we had added memory
increasingly hamper effective system administration. constantly and showed output result with graphic form
Human intervention is required to enchance the and accelerated the speed of operation of basic
performance and capacity of the components in IT system. component of computers, we also think that the important
But, with the development of the technology of science function of machine is computing and operating. We had
and information, human intervention leads to many not realized the real intention of the people using
problems. So, One new technology -- Autonomic information that is not only computing but also hide
computing is presented. It helps to address complexity by complexity and reduce manual intervention.
using technology to manage technology. Autonomic computing is a model of automatic
To complex system, autonomic computing has self- management. The concept is from the immune system of
managing ability .it accomplish functions by taking an human body, and gains this name. The autonomic
appropriate action based on one or more situations that computing system can control programs and system
they sense on environment. The automatic ability is a function of computer while will not bother users as well
control loop that collects details from the system and acts as human bodies' immune system. It is the important goal
accordingly. It includes self-monitoring, self- healing, that we can create the system with self-running and
and so on. Automatic computing leads to huge benefit to advance function, and the complexity of system cannot be
complex system management. But, existing studies main perceived by users [1].
pay attention to server, the personal computer is ignored. IBM Corporation at first put forward the concept of
With the development of the technologies of electronic autonomic computing in 2001[2]. The purpose is that will
and microelectronics, personal computer is the main tool create a computer system to fulfill self- management. In
which is necessary in many fields such as routine, study, this system computer can monitor voluntarily own

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1227-1233
1228 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

running-state and carry out the corresponding treatment autonomic systems and provide some recommendations
operation about different state based manage tactic. It for how these challenges may be met in paper [16].
may reduce system complexity and the cost of manage Brittenham designs one autonomic computing
because manager was free from the complex task for PC. framework for IT service management. It defines a set of
The main idea of autonomic computing is the function best practices to align information technology (IT)
of automatic management with four characters [3]: self- services to business needs by using the IT Infrastructure
controlling, self-configuring, self-optimizing and self- Library[17].This framework helps organizations manage
healing. (1) Self-controlling: the system knows about the IT services using standard design patterns and the
current state and memory capacity and connecting requisite customization. He discuss critical contributions
devices of each element in it. It can detect and forecast that autonomic computing offers to the definition and
and recognize some attack from everywhere to protect implementation of an ITSM architecture and
itself; (2) self- configuring: the system configuration will infrastructure. He first introduces key architectural
be completed automatically and it can adjust some patterns and specifications of autonomic computing as
parameters to keep the system work steadily and they relate to an ITSM logical architecture. Then he
continuously; (3) self-optimizing: the computer system shows how autonomic computing delivers value through
will dispatch some resources automatically in order to a set of ITSM-based case studies that address problem
complete the goal of work normally; (4) self-healing: the determination, impact assessment, and solution
computer system can take the measure of correcting deployment.
according to the tactics to restore the running state of it if Because computing systems have become so complex
there are some problems, which is normal or accident, in that the IT industry recognizes the necessity of
PC. One autonomic computing system should include deliberative methods to make these systems self-
some parts: monitoring, analyzing, planning, executing configuring, self-healing, self-optimizing and self-
and knowledge base, and these parts form a circulatory protecting. Architectures for system self-management,
system that need operate and improve. srivastava explores the planning needs of Autonomiac
Ganek presents an overview of IBMs autonomic computing, its match with existing planning technology
computing initiative in paper [14]. It examines the and its connections with policies and planning for web
genesis of autonomic computing, the industry and services and scientific workflows (grids) inpapter [18].
marketplace drivers, the fundamental characteristics of Then, he shows that planning is an evolutionary next step
autonomic systems, a framework for how systems will for AC systems that use procedural policies today.
evolve to become more self-managing, and the key role In addition, application-layer networks (ALN) are
for open industry standards needed to support autonomic software architectures that allow the provisioning of
behavior in heterogeneous system environments. services requiring a huge amount of resources by
Kephart introduces a unified framework that connecting large numbers of individual computers, e.g.
interrelates three different types of policies that will be Grids and P2P-Networks. Self-organization, like
used in autonomic computing systems: Action, Goal, and proposed by the Autonomic Computing concept, might
Utility Function policies in paper [15]. These policy be the key to controlling these systems. So, by CATNET
framework is based on concepts from artificial project[19], eymann evaluates a decentralized mechanism
intelligence such as states, actions, and rational agents. It for resource allocation in ALN, based on the economic
shows how the framework can be used to support the use paradigm of the Catallaxy. The economic model is based
of all three types of policies within a single autonomic on self-interested maximization of utility and self-
component or system, and use the framework to discuss interested cooperation between software agents, who buy
the relative merits of each type. and sell network services and resources to and from each
System and network security are vital parts of any other.
autonomic computing solution. The ability of a system to The above concept, however, is put foreword to
react consistently and correctly to situations ranging from resolve some complex problems about sever firstly. As to
benign but unusual events to outright attacks is key to the server, the personal computer has its own characteristics,
achievement of the goals of self-protection, self-healing, such as lower hardware requirement, cheaper price, large
and self-optimization. Because they are often built around quantity, wide application range etc. Furthermore, the
the interconnection of elements from different people who operated personal computer almost are
administrative domains, autonomic systems raise ordinary users that did not participate in specialized
additional security challenges, including the training; the hardware and software configuration often
establishment of a trustworthy system identity, change as not well as sever computer that can improve
automatically handling changes in system configuration fault-tolerant through the technology of hardware
and interconnections, and greatly increased configuration redundant. With constantly updating software and
complexity. On the other hand, the techniques of hardware peoples operated personal computer feel the
autonomic computing offer the promise of making more and more difficult in handling and maintaining. In a
systems more secure, by effectively and automatically word, it become an urgent problem how do apply the
enforcing high level security policies. Chess discuss these technology of autonomic computing into the personal
and other security and privacy challenges posed by computer to simplify people's operation and management.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1229

For this in this paper we define an architectural model This paper used extensible markup language (XML) to
for personal computer autonomic computing to solve save the shared knowledge base because XML language
some problems which lead to the crisis of software has some characteristics that are easy and
complexity. Based on this model, one program with self- internationalized standard etc. And the operation of
monitoring, self-analyzing, self-healing characters is reading and writing is very convenient. Here is a format
implemented in dot net framework. about this file.

III. DESIGN AND IMPLEMENTATION OF ARCHITECTURAL <CTRLContent>


MODEL OF AUTONOMIC COMPUTING BASED ON PC <ID>1</ID>
<TypeName>CPU</TypeName>
The paper has been implemented automatic <UValue>50</UValue>
management about personal computer to reduce the <Unit>%</Unit>
complexity of managing personal computer through <Duration>3000</Duration>
integrating the technology of autonomic computer and </CTRLContent>
personal computer. According to IBM's model of
autonomic computing we defines an architectural model The Node "TypeName" is the type of monitoring;
for the personal computer's autonomic Computing whose "UValue" is upper limit value of "TypeName"; "Unit" is
main thought is that implemented one model with the the unit of "UValue"; "Duration" is the interval of
function of automatic management. It has two parts: self- current value more than UValue, whose unit is
manager and managed resource. See figure1. Managed millisecond.
resource is said to the hardware and software resource. B. Design Operation module of knowledge base
And self-manager is made up from controlling model, The shared knowledge base needs update constantly to
analyzing model, planning model, executing model, support more efficient management tactics. The paper
communication model, etc. They shared a knowledge designed an updating module to update the base. All data
base and formed a circle of controlling in order to control gained by the monitoring module and input by users
managed resource. update the knowledge base through the module.
According to the above file format, this paper use
DataTable class of dot net framework to read the base file,
Self-Manager
and quickly convert this XML file to inner data table
of .net that can be read and written conveniently and
Communications fastly. Here is the main code.

DataTable dtCTRL=new DataTable();


dtCTRL.ReadXml(xmlPath);
Analysis Plan
Read Data
dtCTRL.Rows[0]["TypeName"]
//get the name monitored in knowledge
Outer Monitor Shared Do Outer dtCTRL.Rows[0]["UValue"]
sensor knowledge effector // get the upper value monitored in knowledge
base dtCTRL.Rows[0]["Duration"]
// the interval of current value more than UValue

Write Data:
dtCTRL.Rows[0]["UValue"]=value;
Inner Inner dtCTRL.Rows[0]["Duration"]=value;
sensor effector dtCTRL.WriteXml(xmlPath);
Managed Resource
In addition, we also reference Xml namespace to
Read/Write Xml files that of main class is Xml Document.
Figure1. Self-management model
C. Design monitoring and analyzing module
Because of some reasons such that required resources
As following is the step of implement about automatic are not satisfied and connected net time out etc. some
management module. processes has been run in computer could be suspended
A. Building shared knowledge base and do not accept information from users. The operation
of user controlling computer will be influence and
In order to save tactics, logs and performance
computer's running-state will become unsteady. The
information of system we built shared knowledge base
entire computer system maybe breakdown with the more
that gets information from monitor module and supported
and more the above processes. To some extent, maybe
by users. When some condition is satisfied the role will
people find these processes suspended and kill these
be triggered and then lead to some corresponding action.
processes through the task manager of windows, maybe

2012 ACADEMY PUBLISHER


1230 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

we could not do anything on PC and finally computer cooperative between computers that could extend the
system only is restart. function of sensor and effector.
The paper designed monitoring module and analyzing In this paper implemented effector will terminate some
module based the model of automatic management. This processes to restore the normal state of computer, which
program will collect performance information of system shows the function of self-healing. We may use the Kill()
that included the occupancy rate and utilizing rate of CPU, method of Process class to achieve this goal in dot net
the utilizing rate of memory and net connecting state etc environment. The main step is:
through sensor controlling. The information is main parts
of the shared knowledge base and will be analyzed in Process
order to check if the current action of computer system is myprocess=process.GetProcesses(ProcessName);
correct by the operating module of the base. If abnormal myprocess.Kill();
behavior of PC was tested the computer system would try
its best to do something to avoid reboot the machine. IV. PERFORMANCE ANALYSIS
In the dot net framework supported powerful operating
The Methods above are realized by programming in
class: Process. It can get some useful information: the
VC#2005 and running on the computer, whose processor
occupancy rate of CPU and the utilizing rate of memory
is P4 2.4G and Memory is 1G and operating system is
and so on. The main code is:
windows XP. The implemented programs included all
In the dot net framework supported some powerful
parts of self-manager and monitor the whole system
operating classes: Process. They can get some useful
through monitoring module and use analyzing module to
information at ease: the occupancy rate of CPU and the
judge whether some processes was suspended. If
utilizing rate of memory and so on[4]. The main code is:
processes suspended were found, the program will
Process[] procs = Process.GetProcesses(); terminate them.
long ram;
TimeSpan tCPU; Result1
DateTime dtStart; program1
bool bResFlag;
foreach (Process myProc in procs) Result2 Do
//get all processes opened by windows program2
{
Monitor Result3
ram = myProc.WorkingSet64;
//get the utilizing rate of Memory program3
tCPU =myProc.TotalProcessorTime;
//get the duration of process
dtStart =myProc.StartTime;
//the time of starting this process
program N
bResFlag =myProc.Responding; //respond Flag
Result N
}

We can get the occupancy rate of CPU from the


current time, the value CPU and dtStart's value. Terminate process
D. Designed planning and executing module
When some problems was found the module will
analyze these troubles to deal with it automatically based Figure2. Self-manager illustrative diagram
on the controlling results and strategies in knowledge
base. Finally, the system will call effectors to do it.
All parts above depend on one another and formed a
E. Designed sensor and effecter module circle of controlling that is used as roles of judgment
The main function of sensor is that collected managed condition, such as judged the suspending factor, the
resource's information, and it has two parts: inner sensor interval of sending message etc. All information was
and outer sensor [5]. The inner sensor could gain inner saved in the knowledge based. These roles can be
resource of computer and outer sensor module may get certainly updated by the communication interface. Here is
some information not in PC through net environment, so the experiment illustrative diagram in figure 2.
it can manage others computers and achieve the goal of A. Implementation Process
extending self model.
To test this model we developed a small program,
The effector does process for managed resource and
called "test.exe". It has a main form user-interface and a
divides two parts: inner effector and outer effector. Inner
command button. As clicked the button, the program will
effector can deal with internal resources of computer
enter the state of endless loop. Furthermore, the program
while outer effector do other outers' through net
will appear a serious bug, which is the memory leak, if it
connection. In addition, we also use peer to peer
has not been end. The program can not be operated in this
technology to search resources to implement remote

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1231

time and even lead to breakdown. See figure 3a and (2) Run automatic management system
figure 3a. selfManager.exe, and this system will monitor all
processes, see figure 5.

(a) Test program Interface

Figure5. Self-management system interface

We can find that there are some important


information, such as memory usage, CPU state and so on.
(3) Execute test.exe, then click TEST button. The
program entered the state of endless loop. Observe
(b) Test program run state the state of "text.exe".

Figure3. Test program state

(1) Build the knowledge base file. We create a new


XML file to store the knowledge base, which
includes some control roles of CPU, RAM and
Net. See figure 4.

<roleBase>
<CTRLContent>
<ID>1</ID>
<TypeName>CPU</TypeName>
<UValue>50</UValue>
<unit>%</unit>
<duration>30000</duration>
</ CTRLContent > Figure6. Monitoring result
< CTRLContent > (4) Terminate "test.exe". The test.exe was
<ID>2</ID> terminated as the role of the knowledge base.
<TypeName>RAM</TypeName>
<UValue>600</UValue>
<unit>MB</unit>
<duration>30000</duration>
</ CTRLContent >
< CTRLContent >
<ID>3</ID>
<TypeName>Net</TypeName>
<UValue>1000</UValue>
<unit>KB</unit>
<duration>30000</duration>
</ CTRLContent >
</roleBase>

Figure7. Operation result
Figure4. The format of Shared knowledge base

2012 ACADEMY PUBLISHER


1232 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

B. Control Threshold Analysis


It is well known that the not responding
phenomenon of the running program occasionally
appeared in the operating system of Windows. Based on
studying of the design principles of operation systems
and the correlation rules of job scheduling, we find that
the not responding problem of running program has some
reasons: 1) one running program occupied operation
capacity of CPU long time; 2) one running program used
a lot of storage space; 3) one running program toke up a
large number of network bandwidth; 4) a appearance of a
complicated concatenation of circumstances. The
operation system will appear some serious results, such as
memory leak, system crashes and so on if the running
problem program has not been end in time. And the
program can not be operated during this time and even
lead to breakdown. Figure10. Network bandwidth utilization and
We tested and analyzed the various factors and draw a response ratio
conclusion. Here are some charts.
Although we may use hard disk to set aside for virtual
memory for Microsoft Windows, the response rate of
computer will decline sharply when the usage of physical
memory exceed a certain limit. According to the testing
result it is obvious that the performance starts to slow as
the utilizing rate of RAM reached or over 60%. And then
not responding problem appears when the utilizing rate
over 80%. From the following image we know that with
the usage rate of memory continuously rises the response
ratio of PC greatly declined.
A large number of network bandwidth was occupied
when PC has been attacked by more and more intruders
and worms. It appears some bad results, such as no
surfing the Internet, slow response and breakdown. From
this chart we found that with a lot of network bandwidth
Figure8. CPU utilization and response ratio was occupied it continuously raise the response ratio of
PC greatly declined..
C. Choose keys of knowledge database
From this chart we found that with the utilizing rate of
CPU continuously rise lock-up occurrence probabilities The influence of utilizing rate of CPU and usage rate
greatly increased. In particular, it appeared not of RAM and usage rate of network bandwidth was
responding problem as the utilizing rate of CPU reached discussed and analyzed on operation system in PC in this
or excessed 50%. paper. And the key values of knowledge case were
provided. The main reason of slow response ratio is the
CPU utilization. The memory utilization and the network
bandwidth utilization only play a secondary or
supplementary role. So 50 percent is selected for the
value of the utilizing rate of CPU that the response speed
is very slow. If the duration is more than 30s the problem
grogram is stopped. The key value of memory is about
60% physical memory, which is 600MB in this paper,
and the key of duration is 30s.
The automatic management system can quickly handle
these errors based on the key values of knowledge base to
ensure normal system environment.

V. CONCLUSION
The paper describes the roles and functions of
Figure9. Memory utilization and response ratio autonomic computing for personal computer, and defines
an architectural model for PC-based autonomic

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1233

computing after analyzing the requirement of personal [10] Wang, Zhi-Li. Test result automatic judgment in network
computer field. According to this model, one program management interface testing. Beijing Youdian Daxue
with self- monitoring, self-analyzing, self- healing Xuebao/Journal of Beijing University of Posts and
characters is implemented in dot net framework. It shows Telecommunications.
[11] Jo, Sun-Moon. Access control technique for XML
that automatic management is feasible for personal documents. The 2005 International Conference on Internet
computer. But this paper concern is the automatic Computing.
management about single personal computer. As for the [12] Li Qiang; Hao Qin-Fen; Xiao Li-Min .An Integrated
problem of automatic management about between Approach to Automatic Management of Virtualized
different computers, further research is needed. Resources in Cloud Environments. COMPUTER
JOURNAL , 2011,54(6): 905-919.
ACKNOWLEDGMENTS [13] Autonomic computing white Paper. An architectural
blueprint for autonomic computing, 2005,Third Edition.
We would like to thank the anonymous reviewers for [14] A. G. Ganek, A. G. Ganek. The dawning of the autonomic
their valuable comments. This work is supported by computing era[J], IBM SYSTEMS JOURNAL, VOL 42,
National High Technology Research and Development NO 1, 2003,5-18.
Program of China (No.2008AA01Z410), National [15] Jeffrey O. Kephart , William E. Walsh. An Artificial
Natural Science Foundation of China (No. 60873071) and Intelligence Perspective on Autonomic Computing
Policies[C], Proceedings of the Fifth IEEE International
the science and technology development project of
Workshop on Policies for Distributed Systems and
Shaanxi province (No. 2007K04-05). Networks: 2004.
[16] D. M. Chess, C. C. Palmer, S. R. White. Security in an
autonomic computing environment[J], IBM SYSTEMS
REFERENCES JOURNAL, VOL 42, NO 1, 2003.
[17] P. Brittenham, R. R. Cutlip, C. Draper, B. A. Miller, S.
[1] Kephart J, Chess D. The Vision of Autonomic Computing Choudhary, M. Perazolo. IT service management
[R].IEEE Computer Society, January 2003;41- 50. architecture and autonomic computing[J], BM SYSTEMS
[2] IBM corporation. An architectural blueprint for autonomic JOURNAL, VOL 46, NO 3, 2007.
computing. http://www-03.ibm.com/autonomic. 2003. [18] Biplav Srivastava, Subbarao Kambhampati. The case for
[3] Steve R. White, James E. Hanson,Ian whalley. An automated planning in autonomic computing[C],ICAC
Architectural Approach to Autonomic Computing [C]. 2005.
Proceedings of the International Conference on Autonomic [19] T. Eymann, M. Reinicke O. Ardaiz, P. Artigas, F. Freitag,
Computing (ICAC04), 2004. L. Navarro. Self-Organizing Resource Allocation for
[4] Christian Nagel Bill Evjen Jay Glynn. Professional C#. Autonomic Networks[C], Proceedingo n 14th International
4rd[M]. 2006. Workshop on Database and Expert Systems Applications,
[5] Roy Sterritt, David F. Bantz. PAC-MEN: Personal 2003.
Autonomic Computing Monitoring Environment [C].
Proceedings of the 15th International Workshop on
Database and Expert Systems Applications (DEXA04),
2004.
[6] Cai ming, Yu wei. An improved method of resource Yalin Song was born in Henan, China, 1977. He received
location based on Peer-to-Peer[J]. Control and Automation, master degree of science in computer science and information
2006,3-3:108-109. technology from Yunnan Normal University, China. Currently,
[7] Neroni, Massimo. Automatic management system for his research interests include automatic computing and image
instrumentation maintenance. Annual International processing.
Conference of the IEEE Engineering in Medicine and
Biology - Proceedings, v 11 pt 5, p 1595, Nov 1989.
[8] Shivers, Olin. Automatic management of operating-system
resources. ACM SIGPLAN International Conference on Xin He was born in Henan, China, 1974. He received the
Functional Programming, ICFP, p 274-279, 1997. PhD degree of engineering in computer science and technology
[9] Cytron, Ron. Automatic management of programmable from xian Jiaotong University, China, in 2011. He is assistant
caches. the International Conference on Parallel Processing, Professor, master supervisor. Currently, his research interests
v 2, p 229-238, 1988. include mobile computing, automatic computing and dynamic
trust management.

2012 ACADEMY PUBLISHER


1234 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Study and Application of an Improved Clustering


Algorithm
Lijuan Zhou
Capital Normal University, Information Engineering College, Beijing, 100048, China
Email: zhoulijuan87@gmail.com

Yuyan Chen and Shuang Li


Capital Normal University, Information Engineering College, Beijing, 100048, China
Email: cyy5112360@163.com, lishuang924@163.com

Abstract This paper, combined with the characteristics of


the early warning about students' grade, represents an II. K-MEANS ALGORITHM
optimization algorithm in order to solve the random
selection from the initial clustering center of results to cause There are four common clustering algorithms:
major influence this volatility defects .It has integrated into partitioning algorithm, hierarchical algorithms, large
the open source WEKA platform. The optimized algorithm database clustering and clustering to classification
not only guarantees the accuracy of the original algorithm, attribute [6]. Among these algorithms, in this paper we
but also improves the stability of the algorithm. use one of the most common partitioning algorithms: k-
means algorithm, mainly because k-means algorithm is a
Index Terms data mining, cluster analysis
classical algorithm to solve clustering problems. It's
simple, fast and it can deal with large data efficiently.
I. INTRODUCTION
Therefore, we choose k-means algorithm to make
As the data volume of database increases constantly, in clustering analysis for students' grade data.
the process of data mining [1, 2], one data mining time is K-means algorithm was put forward by J.B.MacQueen
longer, more rules are mined out. Finally the user will in 1967[7]. It's the most classical clustering algorithm
face a mass of rules. Generally, the users are not that has been widely used in science, industry and many
interested in the potential rules of the overall datum, but other areas, which has produced deep influence. K-means
some implicit ones. When a general algorithm is mining algorithm belongs to the partitioning algorithm. It's an
total data, the mining time increases relatively, so some iterative clustering algorithm. In the iterative process, it
rules will be hardly found out from the entire rules in keeps moving the members of the cluster set till we get
which the user is not interested. And probably some rules the ideal cluster set. The members of the cluster are
can't be mined out because of the 'dilution' of the entire highly similar. At the same time, the members of
data. In this way, the efficiency reduces and useful different clusters are highly diverse. Ki= {ti1, ti2,, tim},
knowledge can't be got. Therefore before the mining of define its average as:
the potential rules, the data area needs to be thinned 1 mt
according to the user's interests. In the practical mi m ij
application of the students grade early warning[3,4], j1 . (1)
using cluster analysis[5] in the data pretreatment stage, K-means algorithm needs the number of expected clusters
firstly cluster the students' grade, secondly thin the data to serve as parameters input. Its core idea is: input the
area, thirdly, process the correlation analysis according to number of expected cluster: K, divide N tuples into K
the user's interests in specific data. This correlation clusters. It makes the members of the cluster are highly
analysis narrows data set's range dramatically in this similar and the members of different clusters are highly
process, which makes the mining efficiency improve. In diverse. The cluster average that is given above is the
other words, this way combines two mining methods cluster centroid. So we can calculate the similar degree or
effectively, processes association rules data mining on the the distance between clusters according to the cluster
basis of clustering. The initial cluster centers of clustering centroid.
K-Means Algorithm are selected randomly, which causes K initial cluster centers of k-means algorithm are
volatility influence to the clustering results. Aiming at allocated randomly or use the previous k objects directly.
this defect, an improved algorithm of selecting initial Different initial cluster centers lead to different clustering
cluster centers is put forward. The experimental result results and the accuracy will also change. And using
shows that the improved algorithm increases its stability clustering algorithm to thin the classification of students'
in the precondition of guaranteeing the accuracy rate. grade is the primary step of the students' grade early
warning research. It has deep influence to the future

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1234-1241
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1235

research. In view of this, aiming at fixing problem of cluster center and Pts objects in its range from S. Delete
initial cluster centers, we put forward an optimized k- all density parameters from R.
means algorithm in this paper. 3) Repeat step 1), 2) until finding out K initial cluster
centers.
III. SEEK INITIAL CLUSTERING CENTER USING The flow chart of the optimized selecting initial cluster
OPTIMIZED ALGORITHM center algorithm is figure 1.
There are two important concepts in optimization
algorithm:

Figure 2. The flow chart of the optimized selecting initial


cluster centers algorithm

As can be seen, setting constants Pts is the most


essential thing in the optimized k-means algorithm for the
initial cluster centers. The range of density parameters
may include the unusual point if Pts is large. This will
inevitably affect the final result. On the contrary, if Pts is
Figure 1. Structure of SimpleKMeans in WEKA small, the k initial cluster centers may be too
concentrated to response the distribution of data initially.
After repeated experiments, the ideal range of Pts is [N/k-
Density parameter: centering object ti, constant Pts 5, N/k-1]. If N is large and K is small, Pts tends to N/k-5
objects are contained in radius r. Then r is called the
is better. On the contrary, Pts is better to tend to in favor
density parameter of ti, we use Ri to represent it. The
of N/k-1. After determining the k initial cluster centers,
bigger Ri is, the lower the density is. Otherwise, it means we get the clustering results through applying the k-
that the regional data density is higher. means algorithm from the beginning of k cluster centers.
Unusual point: This is obviously different from other
In order to examine the effectiveness of the initial
objects in data set.
cluster centers selecting optimized algorithm, we
Optimized k-means algorithm initial cluster centers experimented with students' grade. This paper achieved
selecting algorithm: the improved algorithm on WEKA-3.6.0. The code in
1) Calculate each object's density parameter Ri of set S
WEKA is open for us. We can view and analysis source
to compose a set R. (All objects compose set S)
code in package weka cluster after unzipping the weka-
2) Find minimum Rmin of set R, that is, in the region scr.jar file in the installation directory into eclipse. This
where the object is, the data density is the highest. Treat paper improved Simple KMeans on the platform
this object as a new initial cluster center. Delete this

2012 ACADEMY PUBLISHER


1236 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

according to the processes in fugure1. The structure of }


Simple KMeans on the platform is showed in figure2.
The m_ClusterCentroids marked in figure 2 is a The definition of getDensity(Instances S,int
member variable to store cluster centers. It is randomly Pts,R,indexofP) is as follow:
assigned in SimpleKMeans. Part of the code is as follows: void getDensity(Instances S,int Pts,double [] R, int
[][]indexofP){
Random RandomO = new Random(getSeed()); for(int i=0; i < S.numInstances();i++){
int instIndex; double [] distance;
HashMap initC = new HashMap(); int l=0;
DecisionTableHashKey hk = null; for(int j=i+1; j < S.numInstances();j++)
Instances initInstances = null; {
if(m_PreserveOrder) double dist =
initInstances = new Instances(instances); m_DistanceFunction.distance(S.instance(i),S.instance(j));
else distance[l]=dist;
initInstances = instances; indexofP[i][l++]=j;
for (int j = initInstances.numInstances() - 1; j >= 0; j--) }
{ for(int m=0; m < l-1; m++){
instIndex = RandomO.nextInt(j+1); int min=m;
hk = new for(int mm=m+1;mm < l; mm++)
DecisionTableHashKey(initInstances.instance(instIndex), {
initInstances.numAttributes(), true); if(distance[min]>distance[mm])
if (!initC.containsKey(hk)) {
{ min=mm;
double temp = distance[min];
m_ClusterCentroids.add(initInstances.instance(instIndex)
); distance[min]=distance[mm];
initC.put(hk, null);
} distance[mm]=temp;
initInstances.swap(j, instIndex); int temp1=indexofP[i][min];
if (m_ClusterCentroids.numInstances() ==
m_NumClusters) indexofP[i][min]=indexofP[i][mm];
{
break; indexofP[i][mm]=temp1;
} }
}
}
It is the code of selecting k initial cluster centers. }
This paper improved this part to emphasize. Improved R[i]=distance[Pts-1];
code of selecting initial cluster centers is as follows: }
}
Instances S = new Instances (instances);
Instances initInstances = new Instances(instances); The definition of function getMinR(double []
int k1=0; R,Instances S) is as follow:
double [] R;
int [][] indexofP;//record the object in the range of int getMinR(double [] R,Instances S)
every object density parameter-index {
int Pts=P; int index=0;
for(;k1<k;) double min=R[0];
{ for(int i=1; i< S.numInstances();i++)
getDensity(S,Pts,R,indexofP);//obtain density {
parameter in the range of Pts of every object in S if(R[i]< min) {
instIndex=getMinR(R,S);//minimum of R min =R[i];
m_ClusterCentroids.add(initInstances.instance(instIndex) index = i;
); }
k1++; }
S.delete(instIndex);//delete the cluster centers object return i;
from S }
for(int i=0;i<Pts;i++) //delete Pts objects in the cluster The purpose of improving the algorithm is to reduce the
centers zone from S volatility that randomly selecting the initial cluster
S.delete(indexofP[instIndex][i]); centers impact the clustering results. In the source code of

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1237

'Random RandomO = new Random (getSeed());', we can


see that random number is influenced by parameter 'seed'.
This paper experimented with this. The students' grade is
clustered into 4 clusters, and we gave each 'seed' different TABLE II. `
STUDENT SCORE DATABASE OF CLUSTER ANALYSIS
number. The proportion of the 4 clusters in the clustering
result changes a lot. Table is the proportion result of eng
engl linear military
id lish C
different seeds in SimpleKMeans in the WEKA platform. ish2 algebra training
1
1 60 86 52 22 2
TABLE I. 2 82 73 99 93 3
PROPORTION RESULT
3 73 73 98 88 1

seed
Cluster Cluster Cluster Cluster 4 73 69 88 82 2
0 1 2 3
5 64 70 72 66 2
5 19% 19% 40% 22% 6 68 57 81 67 2
10 30% 25% 22% 23% 7 70 68 89 81 2

20 11% 39% 5% 45% 8 65 64 86 84 2

30 25% 23% 15% 37% 9 71 64 83 77 2


10 69 67 100 71 2
100 37% 21% 21% 21%

From the table we can see that the accuracy is higher B. Improved algorithm analysis with 4 clusters
when seed is 5. Then we experimented with improved By clustering analysis, we get the clustering result as
Algorithm. The smaller range of constant Pts ensured the figure 3:
stability of the algorithm. At the same time, different Pts
choose different initial cluster centers. In this way, we
avoided local optimal solution and ensured the accuracy TABLE III.
NUMBER AND PROPORTION OF EXAMPLES OF 4
of the algorithm. The students represented by the CLUSTERING CLUSTERS
selecting initial cluster center of improved algorithm are
Cluster Instances Percentage
the models of 4 levels students which are classified as cluster0 30 17%
"A", "B,"C" and D" by their grades. This describes that cluster1 65 37%
the improved algorithm not only reduces the volatility of cluster2 40 23%
the old algorithm but also ensured the stability of the cluster3 40 23%
result.
We do the statistical analysis with the 4 cluster
. EXPERIMENT RESEARCH centers in each course and observe the distribution
situation of maximum and minimum values of 4
A. Data preparation cluster centers.
Data sources from a 4-year-grade of the students in a
normal class. After cleaning and converting the data
TABLE IV.
sources, we get the data of cluster analysis as table DISTRUBUTION SITUATION OF MAXIMUM AND
shows: MINIMUM VALUES OF 4

Cluster maximum minimum

cluster0 20 1

cluster1 21 0

cluster2 0 39

cluster3 1 2

It's easy to see that the grades of cluster0 and cluster1


are better, cluster2's is worse and cluster3's is general in
table and table .
Figure 3. Clustering results Then we use a graph to observe the grade's distribution
of the 4 clusters in each course. The cluster result is in the

2012 ACADEMY PUBLISHER


1238 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

last of properties. Click it to view the statistics and each part. The largest proportion is in the middle and
histogram. The histograms from the left to the right stand small proportion is in the other parts. This proves that the
for cluster0 cluster1, cluster2, cluster3.As figure 4 shows: cluster center of cluster3 is in the middle.
Next, we analyzed the situation of computer
professional courses. First of all, we choose C
programming language which is the earliest course they
learnt. We showed the analysis of C programming
language. As figure 6 shows:

Figure 6. Distribution of the course "College English level 1"

Figure 4. Distribution of the course "College English level 1"

Select "english1" to analyze the distribution of 4


clusters in the course "College English level 1", as figure
5 shows:

Figure 7. Distribution of the course "Compiler Principle"

The histogram shows that the scores of all the students


range from 42 points to 95 points. The light blue that
stands for cluster2 almost occupies all the low scores.
The red that stands for cluster0 occupies a large
proportion of high scores. The blue that stands for
cluster1 occupies many high score students too. And the
grey that stands for cluster3 is still a large proportion of
middle scores. Therefore, we suspect that the scores of
cluster0 are better than that of cluster1.
Then we choose the representative professional courses,
Figure 5. Distribution of the course "College English level 1" 'Compiler Principle'. We showed the analysis of Compiler
Principle. As figure 7 shows:
We can see from the histogram that the score of The histogram shows that the scores of all the students
College English Level 1 ranges from 49 to 85. The range from 26 points to 96 points. And most of the scores
score on the right is higher than in the left. The overall are relatively high because high-score students occupy
result is not high because there are so many students in the largest proportion of all the students. The light blue
the middle. The light blue that stands for the cluster2 that stands for cluster2 still occupies a large proportion of
occupies a large proportion of low scores and a small low scores and a small proportion of high scores. The red
proportion of high scores. Both of the red that stands for that stands for cluster0 almost occupies all the high scores.
the cluster0 and the blue that stands for the cluster1 The blue that stands for cluster1 and the gray that stands
occupy a large proportion of high scores. The gray that for cluster3 are the same situation as they are in figure6.
stands for the cluster3 occupies different proportions in

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1239

Therefore, the result confirms our guess that the scores of


cluster0 are better than that of cluster1.
These histograms in figure 8 are the situation of all the C. Improved algorithm analysis with 5 clusters
courses: Some teachers are prefer to classified students into 5
levels such as "A", "B","C" , " D" and "E" by their grades.
So we try to set the number of expected clustering cluster
to 5, and lets see whether it can produce a more accurate
result from figure 9. The histograms from the left to the
right stand for cluster0 cluster1, cluster2, cluster3 and
cluster4. As figure 10 shows. And the table 5 shows the
proportion of the instances in different cluster.

TABLE V. NUMBER AND PROPORTION OF


EXAMPLES
OF 5 CLUSTERING CLUSTER
Cluster Instances Percentage
cluster0 30 17%
cluster1 37 21%
cluster2 43 25%
cluster3 40 23%
cluster4 25 14%

Figure 8. Distribution of the course "College English level 1"

The courses that their histograms marked by rectangles


are the computer profession classes while the courses that
their histograms marked by ellipses are pedagogy
curriculums. They both affect students' achievement. By
analyzing a lot of histograms of different courses, we Figure 10. 5 clusters histogram clustering results

confirmed our earlier conjecture further: the scores of


cluster0 are higher than cluster1. And the whole situation
is that the score of the light blue that stands for cluster2 is
lowest, the score of the red that stands for cluster0 and
the blue that stands for cluster 1 are higher, and the score
of the gray that stands for cluster3 is in the middle.
We checked the result with the fact of students, and
it verified our conjecture.

Figure 11. Distribution of the course "College English


level 1" of 5 clusters

Figure 11, 12 and 13 are the distributions of different


courses when the number of cluster is 5. From these
figures we can see that: the red that stands for cluster1 is
worse and occupies a part of low scores, its
Figure 9. The result of 5 clustering clusters

2012 ACADEMY PUBLISHER


1240 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Characteristics is not obvious. The focus of research of through the weka platform. In our experiment, we
students achievement early warning is low scores data divide the number of clustering cluster into 2 kinds,
zone, so we need the data has distinguished feature and one is 4 and the other one is 5.We analyze the results
be representative. On the other hand, the grey that stands that have 4 clusters in detail and divided students into
for cluster3 and the light blue that stands for cluster2 are four clusters as cluster0, cluster1, cluster2 and cluster3.
both represent the high block, the blue that stands for Cluster2 are the students with lower scores that we pay
cluster0 and the pink that stands for cluster4 are both in close attention to. Students of cluster0 and cluster1
the middle, and we cant distinguish which one is higher. have a good achievement and the cluster3 is in the
middle. At last we find out that the result of four
clusters is more accurate when comparing example of
clusters with the reality. The experiment shows that the
optimized algorithm makes up for deficiencies of the
algorithm and improves the stability of the algorithm
as well as ensures the accuracy of the original
algorithm.

ACKNOWLEDGMENT
This research was supported by China National Key
Technology R&D Program (2009BADA9B00),
This research was supported by National Nature
Science Foundation (61070050)
This research was supported by Beijing Educational
Committee science and technology development plan
Figure 12. Distribution of the course "College English project (KM20111002818),
level 1" of 5 clusters This research was supported by the Open Project
Program of Key Laboratory of Digital Agricultural Early-
warning Technology, Ministry of Agriculture, Beijing,
100037.

REFERENCES
[1] Zhang Yuntao, Gong Ling.Principles and techniques of
data mining[M]Electronics Industry Press200443
40
[2] Jiawei Han, Micheline Kamber. Data Mining Concepts and
Techniques[M].Mechanical Industry Press.200731
23
[3] Hall P. Beck and William D. Davidson.ESTABLISHING
AN EARLY WARNING SYSTEM Predicting Low
Grades in College Students from Survey of Academic
Orientations Scores[J] Research in Higher
Education200142(6)709-723
Figure 13. Distribution of the course "Compiler Principle"
of 5 clusters [4] Deirdre Billings Early Warning Systems Improving
Student Retention And Success[C] Proceedings of the
15th Annual NACCQ Hamilton New Zealand July
According to the students actual situation, we choose 2002
the result of the number of cluster is 4. The students that [5] Yang Xia-ling, Nie Yong-hong. Application of the
cluster 2 stands for in this result are the data area we clustering analysis in the forecast of employment of
choose for students achievement early warning . graduates [J].Journal of Guang Xi University of
Technology.2005,16(4):82-84
[6] Margaret H.Dunham. Data Mining Tutorial[M].Tsinghua
. CONCLUSION
University Press.20063107-122
This paper is mainly about the use of clustering [7] Chen Jian. Supporting the Knowledge Management Model
algorithm in data mining to pretreat students' of the Distributed Information System- Applied Network
Technology and the Decision-making Support System of
achievement. At first, we introduce the concept of the
Data Warehouse.[N].China Information Review.2001(4)
k-means algorithm, and then we propose an optimized
59 60
algorithm to make up for deficiencies of the algorithm
and integrate into the open-source WEKA platform.
Then we clean and convert the data sources and we
cluster analysis by using the improved algorithm

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1241

Yuyan Chen received the Bachelor of Engineering degree in


Lijuan Zhou received the Btech degree in Computer Electronic Information Engineering from the Capital Normal
Application Technology from the Heilongjiang University in University in 2011 and she is currently studying for a master's
1991, the MS degree in Computer Application Technology from degree at the Capital Normal University. Her mentor is
the Harbin University of Science And Technology in 1998 and Professor Lijuan Zhou and her main research fields are data
the PhD degree in Computer Application Technology from the warehouse and data mining.
Harbin Engineering University in 2004.
She is a professor of database system and data mining at the
Capital Normal University. She has conducted research in the
areas of database systems, data mining, data warehousing, Web Shuang Li received the Bachelor of Engineering degree in
mining, object-oriented database systems, and artificial Computer Science and Technology from the Capital Normal
intelligence, with more than 30 journal or conference University in 2007 and the MS degree from the Capital Normal
publications. Her primary research interests are in OLAP, data University in 2011. Her mentor is Professor Lijuan Zhou and
mining, and data warehouse. her main research field is data mining. He has published three
papers in the international conference.

2012 ACADEMY PUBLISHER


1242 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Dynamic Analysis for Geographical Profiling of


Serial Cases Based on Bayesian-Time Series
Guanli Huang
Beijing Vocational College of Electronic Science,Beijing, China
huangguanli@dky.bjedu.cn

Guanhua Huang
School of Social Development and Public Police, Beijing Normal University, Beijing, China
cherryhenry@163.com

TABLE I.
CHOICES FOR THE DECAY FUNCTION
Abstract The analysis of spatial information has long been
considered valuable for police agency within the criminal Choices for the decay function
investigative process. This is especially true for serial crime Linear f (r ) = A + Br ;
cases where criminologists and psychologist apply
f (r ) = Ae Br ;
Negative
geographical profiling to model criminal mobility
exponential
distribution and behavior patterns in order to estimate a
f (r ) = A(2 S 2 )1 / 2 exp[(r r )2 / 2 S 2 ] ;
Normal
criminals likely residence. In recent years the availability of
advanced computational mathematic tools ensured us to
f (r ) = A(2 r 2 S 2 )1 / 2
Lognormal
establish some mathematical models to replace traditional
empirical method. However, as a new technology, current ;
geographical profiling models are still fundamental and exp[(ln r ln r )2 / 2 S 2 ]
impractical. In this article, based on existing frameworks,
we establish three new methodologies, namely, Bayesian
Truncated f (r ) = Br if r<C and
negative
r
Factor analysis model, Time series analysis model and exponential f (r ) = Ae if rC .
GIS(Geographic Information System)Decay model, to
study geographical profiling problems. Then, we test and
compare their accuracy, efficiency, sensitivity and robust development of computational mathematics, there are
according to 11 historical serial crime samples and Monte many recently developed frameworks about this issue.
Carlo simulations. Finally, we discuss the advantage and Most famous one of them is called geographical profiling.
disadvantage of each model and provide executive
guidelines about how to synthetically apply these models for A. The past theories
real cases. Geographic profiling is a criminal investigative
methodology that analyzes the locations of a connected
Index Terms geograiphical profiling, time serises analysis,
bayesian analysis, geographic information system
crime series to make a prediction about where the
offender is most likely to live. It is primarily used as a
suspect prioritization and information management tool
I. INTRODUCTION [1]. The main process for geographic profiling is using
crime scene information (e.g. the locations of crime) to
Can we tell where an offender lives from where he or infer personal characteristics of the responsible criminal
she commits crimes? Can we predict where he or she will (e.g. the offenders residence). Recently, there is
commit next crime? An excellent police officer may academic debate over the appropriate method for
answer, Yes, but sometimes. Indeed, some police geographical profiling. Some researchers arguing for
officers can make correct judgments for some cases simple geometric and statistical methods [2][3], whereas
according to their experience and intuition. But hardly others for complex computer-based algorithms[4].
any of them can tell which one is their lucky case or The theory and conceptual framework of geographical
why their decisions are correct. That is why we need to profile is built upon environmental criminology [5].
develop a scientific and systemic methodology to instead Environmental criminology involves a number of
our traditional empirical operations. Due to the theoretical concepts, such as Routine Activity, Crime
Pattern and Rational Choice.
The Routine Activity theory [6], suggests that crime
The 2010 Scientific Fund of Beijing Education Commission depends on the intersection of three elements (offender,
ITEM NO. KM201000002002)China.
target, and environment). In detail, when an individual

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1242-1249
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1243

who want to offend in an environment that is appropriate models are respectively accomplished in dealing short
for criminal activity (e.g. in the absence of surveillance) (less than 6 offends), medium (6-8), and long (more
encounters a suitable target, crime will occur. than 8) serial crimes.
Geographical profiling is based on the premise that crime The first model is GISDecay model. Geographic
will happen only in the places that during criminal routine Information System is professional software captures,
activities. Therefore the area around an offenders home measures and analyzes geographic features. With the help
is a most likely place to commit crime. of GIS, we develop a travel metric function which
The second theory, Crime Pattern Theory [5], includes travel distance, travel cost and time cost instead
recommends that crime will occur in places where an of Euclidean (straight line) or Manhattan (travel distance)
offenders awareness space intersects with perceived metric that are widely used in former articles. This model
suitable targets. So just like the routine activity theory, has an obvious advantage for cases with small data set,
criminal routine activities are also important in but a relatively low efficiency than other two models. So
determining the locations of crime in this theory. So, the this model can give useful information to police agency
two theories provide the theoretical basis for geographical even at the beginning of a serial crime.
profiling. The second model is BayesianFactor analysis model.
According to Rational Choice Theory [7], the decision This method is based on Bayesian statistics that treat each
to commit crime is the outcome of a decision-making crime location as independent random variables obey
process which weighs the expected cost, efforts, and some certain distribution and then calculate the
rewards. If the costs and efforts low enough and the distribution of anchor point. The distribution of each
rewards are high enough, crime will occur. The key crime location is decided by some factors that may
element of this theory for geographical profiling is that influence the criminals spatial decisions. This model has
locations and targets that are easily accessible to the moderate accuracy and efficiency but a strong robust.
offender will be perceived as concerning few costs, little This model is suitable for the case with medium data set
effort and greater rewards. So it is the locations around and the case that we have some information about
the home predicted to be the areas in which criminals are criminals characteristic.
likely to offend. But the potential costs of crime around The third model is Time series analysis model. In this
the home location can be high (because of the risk of model, we treat crime locations as a correlated ordering
being recognized) and traveling long distances may gain sequence other than independent points. By deduce
great rewards. In sum, the area around the home location autoregressive model and solve Yule-Walker equations,
is by no means the location a criminal will choose to we can not only find anchor point but also predict next
offend. latent crime site. The performance of this model improves
Despite the argument on geographical profiling significantly while data set getting large. And this model
method, it is simple and undisputed that the basic can also deal some complicated situations like multi
assumptions underlying the process of geographical anchor points and nomadic criminal.
profiling. There are four major theoretical and The main purpose of our models is to aid in police
methodological assumptions required for geographic agencys investigations of serial criminals with helpful
profiling [1]: Left- and right-justify your columns. Use prediction. The input to our model is the locations of
tables and figures to adjust column length. On the last crimes from a series and the output from our model will
page of your paper, adjust the lengths of the columns so be some possible areas with different probability for the
that they are equal. Use automatic hyphenation and check location of offenders home and next crime. And the local
spelling. Digitize or paste down figures. The case geographical information and the sequence of crimes will
involves a series of at least five crimes, committed by the be used in our model. In our model, we use Geographic
same offender. The series should be relatively complete, Information System (GIS) software to measure and
and any missing crimes should not be spatially biased calculate detailed geographical features of the map. This
(such as might occur with a non-reporting police application guaranteed our model enjoy a better accuracy
jurisdiction).The offender has a single stable anchor point than other models simply on plane map. Also, most
over the period of the crimes. The offender is using an factors that may significantly influence spatial decision-
appropriate hunting method. The target backcloth is making will be considered in our model. Father more; our
reasonably uniform. models can work on some complicated situations such as:
However, some of those assumptions are not necessary the criminal has two bases or the nomadic criminal.
in our model which combines different schemes that are
suitable for various situations (e.g. a series of four crimes, II. GIS-DECAY MODEL AND BAYESIAN-FACTOR
the offender has two bases, the criminal changes his ANALYSIS MODEL
hunting strategy during serial crimes). In other words, the
reduced level of accuracy will be controlled in the model A. Geographical Information System
when some information is missing. when we solve geographical profiling problems, time
B. Our Dynamic Analysis models is more than just money. We cannot wait the data set
Established three different geographical profiling getting large enough for our model. That means more
models: GISDecay model, BayesianFactor analysis victims suffering from the criminal. We need a model
model, and Time series analysis model. These three that can work well with a small data set. So this means

2012 ACADEMY PUBLISHER


1244 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

sometimes a simple model works better than a too are shown in the table I. Then we get a travel cost metric
complicated one. Suppose helping the police agency by separate the path between two points into small
investigating a new started serial crime. All the intervals, calculate the time cost for each intervals and
information we have so far is just a few crime locations integral them to get a total time cost. Then based on the
(usually less than five), and some hypothesis of average available time of a criminal, we can get the
criminals identity (gender, age, travel method, etc.). By distribution of the average distance on map.
this insufficient information, we may hard to give a Decay functions: When decided the distance , we
precise forecasting for the location of anchor point and can describe the criminals spatial decision making
next crime site. However, still make full use of these strategy by a decay function. The idea of this function is
resources and give some useful guidelines to help police criminals are most likely commit crimes at the area
agency to carry on searching. around , the possibility decreases while both increasing
Geographical Information System are cluster of tools and reducing travel distance. Several typical choices for
used for collection, storage, management, processing, decay function are shown in table II.
analysis, display and description of geographical data. Score functions: After making a choice of distance
Theyre highly efficient under the appropriate support of
computer science. Subjects of GIS include spatial data, metric d and decay function f , we can construct a score
raster data, remote sensing data, etc. Development of GIS function S ( y ) by summing f for all available points
has greatly improved the methodology of geographic x1 xn
sciences, as well as provided chances of solving n
sophisticated problems on planning, decision making and S ( y ) = f (r ( xi , y )) = f ( r ( x1 , y )) + ... + f (r ( xn , y )) . (1)
managing. GIS can not only be applied on study and 1

research, but more daily issues as well, such as the case Areas which satisfy a high score are considered to be
of criminology. In fact, GIS has proved its value in much more likely to contain the offenders anchor point, vice
larger scale recently. Some even use GIS during design of versa.
500 m2 plaza and receive considerably convincing
C. Bayesian analysis
result.Test the efficiency and reliability of different
schemes with the help of GIS. Actual application include Recently, a new method using Bayesian analysis has
following operation: Vectorization of spatial data; been developed to solve geographic profiling problem.
Display of geographic information; Comparison between Firstly, let us consider the simplest (may also the worst)
historical sample and algorithm result. situation for a serial crime. Suppose the police agency
know no information (gender, age, behavior habits, etc.)
B. Distance metric about the offender. All they know is this criminal has
One basic concept for geographical profiling problem offended a serial crime, and the location of each criminal
is average distance ( ) a criminal is willing afford to scene is available. Thus, we assume that this criminal
commit a crime. However, how to measure this distance commits each offend randomly according some
depends on the choice of distance metric. Typically, there particularly probability density P( x) . So, by statistics
are many widely used choices for this metric including knowledge, we know the probability that the criminal
Euclidean metric and Manhattan metric. Both of these offend a crime in a given area can be calculated
metrics treat the hunting area as a uniform and isotropic
as P ( x)d .
2-dimensional plane, which means they ignore the

detailed geographical features of the region. As we know, According to our basic assumptions, though there is no
driving along the road is much faster than traveling information about the criminal, we do know there are at
through a farmland; and taking a ferry or train may cost least two factors may influence his/her target location
extra travel expenses. According to psychological theory, choice. First one is the criminals anchor point z .
most criminal will calculate the travel cost against his/her Another one is the average distance this criminal is
desire for the target. So the region with high travel cost willing to travel to commit a crime, this distance is
may less likely suffer from this criminal. By GIS, we can denoted by . For each given pair of z and , the
set some travel cost functions about the geographical conditional probability density of x is P( x | z , ) . By
features. For example, in our models, we set different Bayes Theorem, the conditional probability density of z
travel speed for different landforms; the specific settings for a given pair x and can be presented as
P ( x | z , ) ( z , )
TABLE II. P ( z , | x) = . (2)
AVERAGE TRAVEL SPEED IN DIFFERENT LANDFORMS P ( x)
Landform Average travel speed (km per The term P( x) is the marginal distribution which is
hour)
Highway 100 independent with z and . Since we do not care the
Road 70 absolute value of the probability density, we can ignore
City area 40 P( x) term and rewrite (2) as
Farmland 20
Mountain area 15 P ( z , | x) P ( x | z , ) ( z , ) . (3)
Forest area 10

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1245

( z , ) is the factor function which characterizes the


TABLE III.
criminals spatial decision making strategy. CONSIDERING FACTORS
So for a serial crime locations x1 xn , the conditional Factor Denotation
probability density of z can be denoted Time coverage TC
as P ( z , | x1 xn ) , then we can get the multivariate Mode of transportation MT
formation of (3) Intellectual capability IC
P ( z , | x1 , , xn ) P( x1 , xn | z , ) ( z, ) . (4) Age AG
Gender GE
For the mathematical simplicity consideration, an
assumption that all of the offence sites are independent is independent with each other and obey normal
widely used. Then we get the decomposition distributions.
P ( x1 , xn | z , ) = P ( x1 | z , ) P( xn | z , ) . (5)
TC 2 MT 2
So ( z , ) = A(2 ) 1/ 2 exp({(1 ) + (1 ) +
TC MT
P ( z , | x1 , , xn ) P ( x1 | z, ) P( xn | z , ) ( z , ) . (9)
IC 2 AG 2 GE 2
(6) (1 ) + (1 ) + (1 ) })
Many articles still assume the independency IC AG GE
between z and . But in the following sections, we will All the factors have already been normalized before
illustrate this assumption is not only unnecessary but also taken in to (9).
incorrect in many cases. Another approach of factor function assumes that all
At last, since we only care the location of the anchor these factors only influence the average criminal
point z , we can integral (6) and get the conditional distance , i.e.
probability density of z independent of . = g (TC , MT , IC , AG, GE ) (10)
P ( z | x1 , , xn ) P( x1 | z , ) P( xn | z, ) ( z, )d . Since we usually assume is a negative
exponentially decay function, a possible form of can
(7)
be represented as
The expression P ( z | x1 xn ) figures out the anchor
TC MT IC AG GE
point probability density of the criminal who have = A exp( Br ( )) (11)
already committed crimes at the locations x1 xn . Then TC MT IC AG GE
The above discussion just showed two basic example
naturally this probability density provides us a rigorous of factor function. The form of factor function for a
search area with high probability to find the anchor point specific case still greatly depend on the experience of
of the offender. local police agency and former samples of that region.
D. Factor functions
Most researches on geographic profiling only consider III. TIME SERIES ANALYSIS MODEL
the main factor of the distance between criminals anchor
point and crime location. At meantime they may ignore A. Introduction of time series analysis
the factors that potentially influence the spatial decision In geographic profiling study, the available criminal
making, Lundrigan et al. [8] pointed out that all criminal locations can be treated as a sequence of data points,
spatial decisions are mediated by social, economic, and which can also be called a time series in statistics. Time
cognitive factors. Those factors that may influence the series analysis is the subject that aims to find the methods
location of the criminals home include: the development for analyzing time series data in order to discover
of the series; the age, intellectual capability, employment meaningful statistics and relationships of the data. The
status, marital status, and motive; the mode of Bayesian analysis method we discussed in section III
transportation that they use; and the type of the crime. only regards the location information but ignored the
Snook [9] gave some brief explanations for the factors ordering of these crimes. The temporal ordering
influence serial murderers spatial decisions. So there are information of a serial crime is often as significant as the
several factor that may influence criminals spatial location information. And the Bayesian analysis method
decision making. In this model, all the factors we take also requires the assumption that all locations are
into consideration are shown in the Table III. A simple independent of each other in order to make the model
model of criminals spatial decision making strategy can mathematically simple in calculation. However, as we
be described as a function ( ) of his/her anchor point know, one important property of the serial crime is that
location ( z ) and average distance ( ) he/she is willing there exist some relationships among all individual
afford to commit a crime, i.e. ( z , ) . Also, this function crimes. So the locations may also be related and this
relationship often contains meaningful information for
should be influenced by all factors in the table. Thus we
depicting criminals behavior. For example, when a
can get (8)
criminal commit a serial crime, he/she may gain
( z , ) = g (TC , MT , IC , AG, GE ) (8)
experience from the previous crimes and adjust his/her
The form of this formula can be diversity since it is strategy in the location choosing for the next one.
merely an empirical function. In our article, we just Actually, Bayesian analysis as a widely used static
consider the simplest situation that all factors are

2012 ACADEMY PUBLISHER


1246 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

multivariate statistics tool is inadequate in many time Then by choosing a proper order t ( 2t < n ) and
variable applications. On the contrary, time series replacing xn with xn 1 xn t , we can get a new set for
analysis is naturally a tool for the temporal ordering and formulas,
correlated data. Therefore, we have good reason to apply
X n = Z + 1 X n 1 + t X n t + n
time series analysis in solving geographical profiling X = Z + X + X
problems. n 1 1 n2 t n t 1 + n 1
(17)

B. Autoregressive models X n t = Z + 1 X n t 1 + t X n 2t + n t
First, we label the available crime locations by their After solving these equations, we can fix , 1 , t
sequence as x1 xn , where 1 n is the natural temporal and get the expression of anchor point Z ,
order of the sequence. And denote the latent new crime Z = ( X n 1 X n 1 t X n t t ) / (18)
location as xn +1 . Consider most general situation, each xn
C. Nomadic criminal and Multi anchor points situations
is decided by some function of t points before it and
Sometimes criminal may commit crimes continuously
some error term, where 0 < t < n . So we can present xn
without going home. Or some criminal commit crimes
as xn = f ( x n 1 , xn 2 , xn t , n ) , where n is a white while traveling. However, it is always hard to judge when
noise error term usually obey a zero mean normal and how often a criminal will return his/her anchor point
distribution, i.e. n WN (0, 2 ) . during this series or whether this criminal is traveling into
There are many classes of time series models which this area to commit crimes. As so far there is no
can have different stochastic presentations. One of them published evidence that police officers are able to answer
with practical importance is the autoregressive (AR) above questions with any accuracy. But if we need this
model. A typically autoregressive model of order t can kind of assumption to help our investigation, we can
simply replace the time series that matches our evidence
be defined as
and use the method in Section III to solve it. For example,
X n = 1 X n 1 + 2 X n 2 + t X n t + n , n WN (0, 2 ) .
A, X 1 , A, X 2 , X n shows the situation that the criminal
(12)
just returned his/her anchor point after first crime and
By replace xn with xn 1 xn t +1 , we can get a set for traveling into another area to commit continuous crimes.
formulas
X n = 1 X n 1 + t X n t + n TABLE IV.
X = X + X DESCRIPTION OF CASES
n 1 1 n2 t n t 1 + n 1
(13) Case No. The number Criminal The type of
of crimes Nationality offender
X n t +1 = 1 X n t + t X n 2t +1 + n t +1 C01 4 China Murder
C02 5 China Murder, Raper
By solving these equations, we can fix all C03 5 China Murder, Raper
parameters 1 t . Then we can predict X n +1 based on C04 5 China Murder
C05 5 U.S.A. Murder
this model, C06 6 China Murder
X n +1 = 1 X n + 2 X n 1 + t X n t +1 + n +1 (14) C07 6 U.S.A. Murder
For a real n point geographical profiling problem, we C08 6 China Murder
C09 7 China Robber
can carefully choosing the model order t to guarantee C10 10 British Murder
that 2t + 1 < n in order to make our model resolvable. The C11 12 China Murder, Raper
model will be a little more complicated if we take the As the proverb goes The mouse that has but one hole
anchor point z into consideration. The number of anchor is quickly taken. Though the single anchor point
point and the criminals behavior will both affect the assumption perfectly fits many cases, some extreme
model form. In this sub section, we will start our crafty criminals may have multi anchor points to hide
discussion with the simplest situation. We assume that themselves. So for some long-serial criminals, it is
there is only one anchor point, and the criminal must necessary to doubt they have more than one hideout. That
return this anchor between any two consecutive crimes. is why and when we need to take a multi anchor model
By this basic assumption, we get the new time series into account. In the following discussion, we only deduce
which contained anchor point z as z , x1 , z , x2 , , z , xn . the model for two anchor points. The idea for the model
Thus, for any given n and t , we can update our model as with anchor points more than two is just the same, but
more complicated in representation.
X n = 1 X n 1 + 1 Z + 2 X n 2 + 2 Z + t X n t + t Z + n
( Assume a criminal committed a serial criminal
= 1 X n 1 + 2 X n 2 t X n t + (1 + 2 + t ) Z + n x1 , , xn has two anchor points A and B . And the
15) average distance this criminal is willing to travel to
Since 1 , 2 , t are all parameters, we can combine commit a crime is . By the basic decay model and
them as a single parameter , then we can rewrite (15) as consider the stochastic factors, each crime location can be
X n = Z + 1 X n 1 + 2 X n 2 + t X n t + n (16) described as

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1247

X n = n A + n B + R exp( n + i n ) , (19) t

( A + B ) i = R ( t 1)
Where t and t are white noises. The complex
i=0

function R exp( t + i t ) is the stochastic decay term, A ( 0 1 + 1 0 + 2 1 + t t 1 ) = 0


(25
which implies a random vector with a negative A ( 0 2 + 1 1 + 2 0 + t t 2 ) = 0
exponentially distributed radius and a uniformly
distributed polar angle. And R is the mean of its radius.
As we known, crime locations x1 , , xn themselves is A ( 0 t + 1 t 2 + 2 t 2 + t 0 ) = 0
a time series which can be depicted by an AR (t) model )
X n = 1 X n 1 + 2 X n 2 + t X n t + n . (20) This set of linear equations is usually called Yule-
Walker equation. Similarly, we can also write the Yule-
Combine these two functions, we get Walker equation for B. By combining these two sets
n A + n B + R exp( n + i n ) Yule-Walker equations, we got a set of liner equations
= 1 ( n 1 A + n 1 B + R exp( n 1 + i n 1 )) . (21) which contains 2t + 2 variables and 2t + 2 independent
+ + t ( n t A + n t B + R exp( n t + i n t )) + n equations. So far, we get the unique solution of this
question.
Due to 1 n 1 are parameters, we can replace them
by 1 n 1 and rewrite the above function as
TABLE V.
t t
A i n 1 + B i n 1 = t ,
THE COMPARISON BETWEEN MODELS WITH THE SHORT GROUP OF CASES
(22)
i =0 i =0 GIS- Bayesian- Time series analysis model
Decay Factor
Where model analysis
t
t = n + R exp( n i +i n i ) R exp( n +i n ) .
model
Hit 4 5 1
i =1
Close 1 0 4
(23) Further 0 0 0
Average 19.5% 14.5% 27.2%
Since A and B are two different anchor point, we can Search
treat n and n as two independent time series. Cost

Furthermore, we need n and n be two weak stationary


time series for the mathematical stringency. The mean IV. MODEL EVALUATION AND ANALYSIS
and covariance of a weak stationary time series X t have
In this section, we will apply several tests to evaluate
the following properties: the accuracy, effectiveness and robustness of our model.
Expectation E ( X n ) = < , independent of n ; Firstly, we define two measures Hit score percentage
Variance Var ( X n ) = 2 < , independent of n ; and Profile accuracy to present the effectiveness and
Covariance Cov( X n , X n t ) = t is a function of t only. accuracy of geographical profiling models. Then we
calculate these two indexes of our models with both
The idea of weak stationary condition is that the
historical samples and Monte Carlo simulation cases.
behavior of the underlying process does not change with Finally, we test the robustness of our models by
time. As a basic assumption of time series analysis, this
mislabeled point method. Followed by each test, we
condition can be easily satisfied by most non-dynamic analyze and compare the advantages and disadvantages
time series. Also we notice that n , n and n are all among these three models.
white noises independent with n and n . Thus we get
A. Historical sample tests
t
E (t ) = E (n + R exp( n i +i n i ) R exp( n +i n ))
i =1 TABLE VI.
t THE COMPARISON BETWEEN MODELS WITH THE LONG GROUP OF CASES
= E (n ) + R E (exp( n i +i n i )) RE (exp( n +i n )) (2 GIS-Decay Bayesian-Factor Time series
i =1
model analysis model analysis model
= R(t 1) Hit 1 1 2
Close 1 1 0
4)
Further 0 0 0
And Cov( t , n ) = Cov ( t , n ) = 0 . Average 23% 28% 6%
Taking E ( t ) , Cov( t , n ) Cov( t , n t ) at the Search
Cost
right side of function (22), we can get
By online data collection, we get 11 true historical
serial criminal samples, including 8 Chinese cases, 2
American cases, and one British case. The British sample
is just the Peter Sutcliffe case which was mentioned in
the Problem B. All these cases are categorized into three
classes by the number of crimes: less than 6, between 6

2012 ACADEMY PUBLISHER


1248 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

and 8, and above 8. (See Table IV)In this article, we criminals characteristic. And time series model, though
choose Hit Score Percentage and Profile Accuracy as performs poor at short serial crimes, its estimate accuracy
measures to evaluate the effectiveness and accuracy of increases greatly while data set getting large. For long
models. serial crimes, it can give a precious result.
The Hit Score Percentage is a measure of search
B. Monte-Carlo simulation tests
efficiency of the model. It is defined as the ratio of the
area searched (following the geographic profiling Due to the number of true historical serial criminal
prioritization) before the offenders base is found, to the samples are limited, our model need turn to Monte-Carlo
total hunting area; the smaller this ratio is, the better simulation for fully tests. Our idea of simulation is first
model performs. The hunting area is defined as the fix an anchor point, and generate several data points
rectangular zone oriented along the street grid containing around anchor point as crime locations by Monte Carlo
all crime locations. [4]Profile Accuracy is defined as a algorithm. Then we run three models to check how well
measure of whether the offenders base is within the top they can find the anchor point. In this test, we change the
profile area. In this paper, we provide a simple number of generated data points from 3 to 20, and run
classification to indicate whether the estimations are hit, 1000 repeated simulations for each circumstance. The
close, and off-target. results of the test are shown in the Fig. 2. According to
the figure, we can see, the result of simulation test
basically matches the result of historical sample test. This
proves that are suitable for common cases analyze and
have good stability.

Figure2. Robustness of Those Models


Figure1. C02, C08 and C10 case results using those three C. Robust test
models.
Sometimes, it is inevitable for police officers to
The tests of Bayesian-Factor analysis and Time series
mislabel an irrelevant offend into the underlying serial
models are programmed by MATLAB, while GIS-Decay
crimes. We concern whether our models can still give a
model is conducted by GIS directly. The parameter
relatively correct answer when this happened. So the
values of Bayesian-Factor model are selected from the
robust test for models is necessary and meaningful.
average values published in some overview articles [9].
The typical results for the three models in each class are TABLE VII.
shown as Fig. 1.Table V, VI and VII is the comparison of THE COMPARISON BETWEEN MODELS WITH THE MEDIUM GROUP OF
the models when handling the cases in short group, CASES

medium group and long group separately.From above GIS-Decay Bayesian- Time series
table we can tell that these three models are respectively model Factor analysis analysis
accomplished in dealing short (less than 6 offends), model model
Hit 3 2 1
medium (6-8), and long (more than 8) serial crimes. Close 1 2 0
GIS-Decay model has an obvious advantage for cases Further 0 0 3
with small data set, but a relatively low efficiency than Average 16.75% 32% 56.75%
other two models. So this model can give useful Search Cost
information to police agency even at the beginning of a
serial crime. Bayesian-Factor analysis model has In order to test robustness of three models, we conduct
moderate accuracy and efficiency but a strong robust. Monte Carlo method to generate data points around the
This model is suitable for the case with medium data set fixed anchor point, and then we randomly locate a
and the case that we have some information about mislabeled data point into the hunting area to test if our

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1249

models can still find the anchor point. For each n [2] Paulsen, D. J. Human versus machine: A comparison of the
( 3 n 20 ), we run 1000 repeated simulations. Then
accuracy of geographic profiling methods. Journal of
Investigative Psychology and Offender Profiling, 3 (2),
for each model, we compare its new search result with the pp.7789,2006.
former one. The comparison results are shown in the [3] Snook, B. Individual differences in distances traveled by
following pictures. According to this result the robust of serial burglars. Journal of Investigative Psychology and
all models increases while data set getting large. We Offender Profiling1, pp. 5366. DOI:
should particularly notice that time series model has a 10.1002/jip.003,2004.
poor robust when n<6. [4] Rossmo, D. K. Geographic heuristics or shortcuts to
failure? Response to Snook et al. Applied Cognitive
Psychology, 19, pp. 651654. DOI:
V. CONCLUSIONS 10.1002/acp.1144,2005.
Geographical profiling is an investigative methodology [5] Brantingham, P. J., & Brantingham, P. L. Environmental
that uses the locations of a connected series of crimes to criminology. Beverly Hills, CA: Sage Publications,1981.
[6] Felson, M. Linking criminals choices, routine activities,
determine the most probable offender residence. In this
informal control, and criminal outcomes. In D. Cornish, &
article, we established three different geographical R. V. Clarke (Eds.), the reasoning criminal: Rational
profiling models: GISDecay model, BayesianFactor choice perspectives on offending,pp. 119128, New York:
analysis model, and Time series analysis model. These Springer,1986.
three models are respectively accomplished in dealing [7] Clarke, R. V., & Felson, M. Routine activity and rational
short (less than 6 offends), medium (6-8), and long choice. New Brunswick, NJ: Transaction,1993.
(more than 8) serial crimes. [8] Lundrigan, S., & Canter, D. V. Spatial patterns of serial
murder: An analysis of disposal site location choice.
Behavioral Sciences and the Law, 19, pp. 595610,2001.
[9] Brent Snook & Richard M. Cullen. Serial Murderers
Spatial Decisions: Factors that Influence Crime Location
Choice. Journal of Investigative Psychology and Offender
Profiling. Psych. Offender Profil. 2: pp.147164,2005.

Figure3. Profile Accuracy


Guanli Huang, female, born in 1975, Associated professor
However, as a new developed methodology, our of Beijing Vocational College of Electronic Science, Master's
models are still imperfect. The form of some functions in degree with the major of computer science. Participant Leader
of Beijing Quality Course on Network development and the
our models still relay on artificial choices. That means the
Committee member on China Computer Federation. She
police officer who make use of these models need have a published widely, such as academic articles in Computer
good experience of regional serial crimes and will not Engineering and Applications, Computer Science etc; her
bring his/her own bias into model operation. research interests include algorithm design, Computer
Inappropriate use of geographical profiling can have Education etc; and has won one National Patents and has
serious consequences. That is all the more reason why we published over twenty academic articles as the first author, two
must develop our understanding of what introduces biases of which are EI Indexed. At present, Ms. Huang also undertakes
and errors into geo-behavioral decision support systems some items, such as Research and Development on Dynamic
and into the cognitive processes of those who use those dispatching GPS system of the Adaptation Road Condition of
the Beijing Education Committee Scientific Project. She is also
systems. Treating these errors as operational problems
the Participant of Simulation Platform of Small Hybrid Vehicle
that have to remain in the hands of police officers will Control Based on dSPACE of the Beijing Science and
keep criminal investigation in the dark ages of intuition Technology Innovation Platform Project, and Practice of
and hunch. School-enterprise Cooperation Mechanism and Platform based
on Diversification of the Beijing University Education Reform
ACKNOWLEDGMENT Projected. Her research Directions are data analysis, signal
control, information security, education management.
This work was supported in part by the 2010 Scientific
Fund of Beijing Education Commission ITEM NO.
KM201000002002).
Guanhua Huang, 1978, a graduate student at Beijing
Normal University. His research interests include Data
REFERENCES
management and technology translation.
[1] Rossmo, D. K. Geographic profiling. Boca Raton, FL:
CRC Press,2000.

2012 ACADEMY PUBLISHER


1250 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Research on Dependable Distributed Systems for


Smart Grid
Qilin Li
Production and Technology Department, Sichuan Electric Power Science and Research Institute, Chengdu, P.R.China
Email: li_qi_lin@163.com

Mingtian Zhou
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu,
P.R.China
Email: mtzhou@uestc.edu.cn

Abstract Within the last few years, smart grid has been cooperate correctly to carry out some common work.
one of major trends in the electric power industry and has Each component runs on a computer. The operation of
gained popularity in electric utilities, research institutes and one component generally depends on the operation of
communication companies. As applications for smart grid other components that run on different computers [1] [2].
become more distributed and complex, the probability of
Although the reliability of computer hardware has
faults undoubtedly increases. This fact has motivated to
construct dependable distributed systems for smart grid. improved during the last few decades, the probability of
However, dependable distributed systems are difficult to component failure still exists. Furthermore, as the number
build. They present challenging problems to system of interdependent components in a distributed system
designers. In this paper, we first examine the question of increases, the probability that a distributed service can
dependability and identify major challenges during the easily be disrupted if any of the components involved
construction of dependable systems. Next, we attempt to should fail also increases [2]. This fact has motivated to
present a view on the fault tolerance techniques for construct dependable distributed systems for smart grid.
dependable distributed systems. As part of this view, we Fault tolerance is needed in many different
present the distributed tolerance techniques for the
dependable distributed applications for smart grid.
construction of dependable distributed applications in smart
grid. Subsequently, we propose a systematic solution based However, dependable distributed systems are difficult to
on the middleware that supports dependable distributed build. They present challenging problems to system
systems for smart grid and study the combination of designers. System designers must face the daunting
reflection and dependable middleware. Finally, we draw our requirement of having to provide dependability at the
conclusions and points out the future directions of research. application level, as well as to deal with the complexities
of the distributed application itself, such as heterogeneity,
Index Termssmart grid, dependability, dependable scalability, performance, resource sharing, and the like.
middleware, fault-tolerance, fault, error, failure, error Few system designers have these skills. As a result, a
processing, fault treatment, replication, distributed recovery,
systematic approach to achieving the desired
partitioning, open implementation, reflection, inspection,
adaptation dependability for distributed applications in smart grid is
needed to simplify the difficult task.
Recently, middleware has emerged as an important
I. INTRODUCTION architectural component in supporting the construction of
dependable distributed systems. Dependable middleware
Within the last few years, smart grid has been one of can render building blocks to be exploited by applications
major trends in the electric power industry and has gained for enforcing non-functional properties, such as
popularity in electric utilities, research institutes and scalability, heterogeneity, fault-tolerance, performance,
communication companies. The main purpose of smart security, and so on[3]. These attractive features have made
grid is to meet the future power demands and to provide middleware a powerful tool in the construction of
higher supply reliability, excellent power quality and dependable distributed systems for smart grid [3].
satisfactory services. Although smart grid brings great This paper makes three contributions to the
benefits to electric power industry, such a new grid construction of dependable distributed systems for smart
introduces new technical challenges to researchers and grid. First of all, we examine the question of
engineering practioners. dependability and identify major challenges during the
As applications for smart grid become more distributed construction of dependable systems. Subsequently, we
and complex, the probability of faults undoubtedly attempt to present a view on the fault tolerance
increases. Distributed systems are defined as a set of techniques for dependable distributed systems. As part of
geographically distributed components that must

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1250-1257
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1251

this view, we present the distributed tolerance techniques designers have to deal explicitly with problems related to
for building dependable distributed applications in smart distribution, such as heterogeneity, scalability, resource
grid. Finally, we propose a systematic solution based on sharing, partial failures, latency, concurrency control, and
the middleware that supports dependable distributed the like. On the other hand, system developers must have
systems for smart grid and study the combination of a deep knowledge of fault tolerance and must write fault-
reflection and dependable middleware. tolerant application software from scratch [2]. As
The remainder of this paper is organized as follows: consequence, they have to face a daunting and error-
Section studies dependability matters for distributed prone task of providing fault tolerance at the application
systems in smart grid and identifies the major challenges level [2].
for the construction of dependable systems. Section Certain aspects of distributed systems make
introduces basic concepts and key approaches related to dependability more difficult to achieve. Distribution
fault-tolerance. In Section , we discusses distributed presents system developers with a number of inherent
fault-tolerant techniques for building dependable systems problems. For instance, partial failures are an inherent
problem in distributed systems. A distributed service can
in smart grid. Section introduces dependable
easily be disrupted if any of the nodes involved should
middleware to address the ever increasing complexity of
fail. As the number of computing nodes and
distributed systems for smart grid in a reusable way.
communication links that constitute the system increases,
Finally, Section draws our conclusions and points out the reliability of components in a distributed system
the future directions of research. rapidly decreases.
Another inherent problem is concurrency control.
System developers must address complex execution
. DEPENDABLILITY MATTERS states of concurrent programs. Distributed systems
consist of a collection of components, distributed over
Distributed systems are intended to form the
various computers connected via a computer network.
backbone of emerging applications for smart grid,
These components run in parallel on heterogeneous
including supervisory control, data acquisition system
operating systems and hardware platforms and are
and distribution management system, and so on. An
therefore prone to race conditions, the failure of
obvious benefit of distributed systems is that they reflect
communication links, node crashes, and deadlocks. Thus,
the global business and social environments in which
dependable distributed systems are often more difficult to
electric utilities operate. Another benefit is that they can
develop, applications developers must cope explicitly
improve the quality of service in terms of scalability,
with the complexities introduced by distribution.
reliability, availability, and performance for complex
In theory, the fault tolerance mechanisms of a
power systems.
dependable distributed system can be achieved with
Dependability is an important quality in power
either software or hardware solution. However, the cost
distributed applications. In general terms, a system's
of custom hardware solution is prohibitive. In the
dependability is defined as the degree to which reliance
meantime, software can provide more flexibility than its
can justifiably be placed on the service it delivers [4]. The
counterpart[2]. As a result, software is a better choice for
service delivered by a system is its behavior as it is
implementing the fault tolerances mechanisms and
perceived by its user(s); a user is another system
policies of dependable distributed systems [2]. However,
(physical, human) which interacts with the former [4].
the software solution for the construction of dependable
More specifically, dependability is a global concept that
is also difficult. This is particularly true if distributed
encapsulates the attributes of reliability (continuity of
systems dependability requirements dynamically change
service), availability (readiness for usage), safety
during the execution of an application. Further
(avoidance of catastrophes), and security (prevention of
complicating matters are accidental problems such as the
unauthorized handling of information) [2] [4]. In power
lack of widely reused higher level application
distributed environments, even small amounts of
frameworks, primitive debugging tools, and non-scalable,
downtime can annoy customers, hurt sales, or endanger
unreliable software infrastructures. In that case, fault
human lives. This fact has made it necessary to build
tolerance can be achieved using middleware [2].
dependable distributed systems for electric utilities.
Middleware can be devised to address these problems
Fault tolerance is an important aspect of
and to hide heterogeneity and the details of the
dependability. It is referred to as the ability for a system
underlying system software, communication protocols,
to provide its specified service in spite of component
and hardware. Built-in mechanisms and policies for fault-
failure [2] [4]. Fault-tolerant systems behavior is
tolerant can be achieved by middleware and provide
predictable despite of partial failures, asynchrony, and
solutions to the problem of detecting and reacting to
run-time reconfiguration of the system. Moreover, fault-
partial failures and to network partitioning. Middleware
tolerant applications are highly available. The application
can render a reusable software layer that supports
can provide its essential services despite the failure of
standard interfaces and protocols to construct a fault-
computing nodes, software object crash, communication
tolerance distributed systems. Dependable middleware
network partition, value fault for applications [5].
shields the underlying distributed environments
However, building dependable distributed systems is
complexity by separating applications from explicit
complex and challenging. On the one hand, system

2012 ACADEMY PUBLISHER


1252 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

protocol handling, disjoint memories, data replication, assumes that the only way a component can fail is by
and facilitates the construction of dependable application stopping the delivery of messages and that its internal
[6]
. state is lost [2] [4].
The timing fault model assumes that a component will
. FAULT TOLERANCE respond with the correct value, but not within a given
time specification [2] [4]. A timing fault model can result in
A. Failure, Error and Fault events arriving too soon or too late. A timing fault model
includes delay and omission faults [2] [4]. A delay fault
In order to construct a dependable distributed system, it
occurs when the message has the right content but arrives
is important to understand the concepts of failure, error,
late [2] [4]. An omission fault occurs when no message is
and fault. In a distributed system, a failure occurs when
received. Sometimes, delay faults are called performance
the delivered service of a system or a component deviates
faults [2] [4]. In the value fault model, the value of
from its specification [4]. An error is that part of the
delivered service does not comply with the specification
system state that is liable to lead to subsequent failure. An [2] [4]
.
error affecting the service is an indication that a failure
Arbitrary fault model is the most general fault model,
occurs or has occurred [4]. A fault is the adjudged or
in which components can fail in an arbitrary way [2] [4]. As
hypothesized cause of an error [4].
a result, if arbitrary faults are considered, no restrictive
In general terms, we think that an error is the
assumption will be made [2] [4]. An arbitrarily faulty
manifestation of a fault in the distributed system, while a
component might even send contradictory messages to
failure is the effect of an error on the service. As a result,
different destinations (a so-called byzantine fault) [2] [4].
faults are potential sources of system failures.
This model can include all possible causes of fault, such
Whether or not an error will actually lead to a failure
as messages arriving too early or too late, messages with
depends on three major factors. One factor is the system
incorrect values, messages never sent at all, or malicious
composition, and especially the nature of the existing
faults [2] [4].
redundancy [4]. Another factor is the system activity. An
error may be overwritten before creating damage [4]. A
third factor is the definition of a failure from the users C. Error Processing and Fault Treatment
viewpoint. What is a failure for a given user may be a Fault tolerance is systems ability to continue to
bearable nuisance for another one [4]. provide service in spite of faults [2] [4]. It can be achieved
Faults and their sources are extremely diversified. by two main forms: error processing and fault treatment [2]
They can be categorized according to five main [4]
. The purpose of error processing is to remove errors
perspectives that are their phenomenological cause, their from the computational state before a failure occurs, if
nature, their phase of creation or of occurrence, their possible before failure occurrence, whereas the purpose
situation with respect to the system boundaries, and their of fault treatment is to prevent faults from being activated
persistence [4]. again [2] [4].
In error processing, error detection, error diagnosis,
B. Fault models and error recovery are commonly used approaches [2] [4].
Error detection and diagnosis is an approach that first
When designing a distributed fault-tolerant system, we
identifies an erroneous state in the system, and then
can not to tolerate all faults. As consequence, we must
assesses the damages caused by the detected error or by
define what types of faults the system is intended to
errors propagated before detection [2] [4]. After error
tolerate. The definition of the types of faults to tolerate is
detection and diagnosis, error recovery substitutes an
referred to as the fault model, which describes abstractly
error-free state for the erroneous state [2] [4].
the possible behaviors of faulty components [2] [4]. A
Error recovery may take on three forms: backward
system may not, and generally does not, always fail in the
recovery, forward recovery, and compensation [2] [4]. In
same way. The ways a system can fail are its fault modes.
backward recovery, the erroneous state transformation
As a result, the fault model is an assumption about how
consists of bringing the system back to a state already
components can fail [2] [4].
occupied prior to error occurrence [2] [4]. This entails the
In distributed systems, a fault model is characterized
establishment of recovery points, which are points in time
by component and communication failures [2] [4]. It is
during the execution of a process for which the then
common to acknowledge that communication failures can
current state may subsequently need to be restored [2] [4].In
only result in lost or delayed messages, since checksums
forward recovery, the erroneous state transformation
can be used to detect and discard garbled messages[2] [4].
consists of finding a new state, from which the system
However, duplicated or disordered messages are also
can operate [2] [4]. Error compensation renders enough
included in some models [2] [4].
redundancy so that a system is able to deliver an error-
For a component, the most commonly assumed fault
free service from the erroneous state [2] [4].
models are (in increasing order of generality): stopping
The goal of fault treatment determines the cause of
failures or crashes, timing fault model, value fault model
observed errors and prevents faults from being activated
and arbitrary fault model [2] [4]. Stopping failures or
again [2] [4]. The first step in fault treatment is fault
crashes is the simplest and most common assumption
diagnosis, which consists of determining the cause(s) of
about faulty components [2] [4]. This model always
error(s), in terms of both location and nature [2] [4]. Then it

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1253

takes actions aimed at making it (them) passive [2] [4]. [5]


. The major challenge of replication technique is to
This is achieved by preventing the component(s) maintain replica consistency [7] [8] [9]. Replication will fail
identified as being faulty from being invoked in further in its purpose if the replicas are not true copies of each
executions [2] [4]. Fault treatment can be used to other, both in state and in behavior [5] [10] [11] [12].
reconfigure a system to restore the level of redundancy so
that the system is able to tolerate further faults [2] [4].
B. Distributed Recovery
In a dependable distributed system, some form of
. DISTRIBUTED TOLERANCE TECHNIQUES recovery is required to minimize the negative impact of a
failed process or replica on the availability of a
distributed service [4]. In its simplest form, this can be just
A. Replication
a local recovery of the failed process or replica. However,
In order to mask the effects of faults, distributed fault distributed recovery will occurs if the recovery of one
tolerance always requires some form of redundancy. process or replica requires remote processes or replicas
Replication is a classic example of space redundancy. It also to undergo recovery [4]. In this case, processes or
exploits additional resources beyond what is needed for replica must rollback to a set of checkpoints that together
normal system operation to implement a distributed fault- constitute a consistent global state [4].
tolerant service [2] [4]. The metaphor of replication is to In order to create checkpoints, there are several major
manage the group of processes or replicas so as to mask approaches. One way is asynchronous checkpointing [4].
failures of some members of the group [2] [4]. By In asynchronous checkpointing, checkpoints are created
coordinating a group of components replicated on independently by each process or replica, and then when
different computing nodes, distributed systems can a failure occurs, a set of checkpoints must be found that
provide continuity of service in the presence of failed represents a consistent global state [4]. This approach aims
nodes [2] [4]. to minimize timing overheads during normal operation at
There are three well-known replication schemes: active the expense of a potentially large overhead when a global
replication, passive replication, and semi-active state is sought dynamically to perform the recovery [4].
replication. In active replication scheme, every replica The price to be paid for asynchronous checkpointing is
executes the same operations [2] [4]. Input messages are domino effect. If no other global consistent state can be
atomically multicasted to all replicas, who all process found, it might be necessary to roll all processes back to
them and update their internal states. All replicas generate the initial state [4]. As a result, in order to avoid the
output messages [2] [4]. domino effect, checkpoints can be taken in some
Passive replication is a technique in which only one of coordinated fashion.
the replicas (the primary) actively executes the operation, Another way is to structure process or replica
updates its internal state and sends output messages [2] [4]. interactions in conversations [4]. In a conversation,
The other replicas (the standby replicas) do not process processes or replicas can communicate freely between
input messages; however, their internal state must be themselves but not with other processes external to a
updated periodically by information sent by the primary [2] conversation [4]. If processes or replicas all take a
[4]
. If the primary should fail, one of the standby replicas checkpoint when entering or leaving a conversation,
is elected to take its place [2] [4]. recovery of one process or replica will only propagate to
Semi-active replication is a technique which is similar other processes or replica in the same conversation [4].
to active replication [2] [4]. In semi-active replication, all A third alternative is synchronous checkpointing [4] [13].
replicas will receive and process input messages. In this approach, dynamic checkpoint coordination is
However, unlike active replication, the processing of allowed so that a set of checkpoints can represent global
messages is asymmetric in that one replica (the leader) consistent states [4] [13]. As consequence, the domino effect
takes responsibility for certain decisions (e.g., concerning problem can be transparently avoided for the software
message acceptance) [2] [4]. The leader replica can enforce developers even if the processes or replicas are not
its choice on the other replicas (the followers) without deterministic [4]. At each instant, each process or replica
resorting to a consensus protocol [2] [4]. One alternative for possesses one or two checkpoints: a permanent
semi-active replication is that the leader replica may take checkpoint (constituting a global consistent state) and
sole responsibility for sending output messages [2] [4]. another temporary checkpoint [4]. The temporary
Semi-active replication primarily targeted at crash checkpoints may be undone or transformed into a
failures. However, under certain conditions, this strategy permanent checkpoint. The creation of temporary
can also be extended to deal with arbitrary or byzantine checkpoints, and their transformation into permanent
failures [2] [4]. ones, is coordinated by a two-phase commit protocol to
Continuity of service in the presence of failed nodes ensure that all permanent checkpoints effectively
requires replication of processes or objects on multiple constitute a global consistent state [4].
nodes [2] [4]. Replication can provide high-available
service for a dependable distributed system. By
replicating their constituent objects and distributing their C. Partitioning Tolerance
replicas across different computers connected by the A distributed system may partition into a finite number
network, distributed applications can be made dependable of components. The processes or replicas in different

2012 ACADEMY PUBLISHER


1254 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

components can not communicate each other [11]. However, building such software infrastructure that
Partitioning may occur due to normal operations, such as achieves dependable goal is not an easy task. Neither the
in mobile computing, or due to failures of processes or standard nor conventional implementations of
inter-process communication. Performance failures due to middleware directly address complex problems related to
overload situations can cause ephemeral partitions that dependable computing, such as partial failures, detection
are difficult to distinguish from physical partitioning [4]. of and recovery from faults, network partitioning, real-
Partitioning is a very real concern and a common event time quality of service or high-speed performance, group
in wide area networks [4]. If the network partitions, communication, and causal ordering of events[9]. In order
different operations may be performed on the processes to cope with these limitations, many research efforts have
or replicas in different components, leading to been focused on designing new middleware systems
inconsistencies that must be resolved when capable of supporting the requirements imposed by
communication is re-established and the components dependability [5].
remerge [5]. One strategy for achieving this is to allow A first issue that needs to be addressed by dependable
components of a partition to continue some form of middleware is interoperability [2]. Interoperability allows
operation until the components can re-merge [4] [11]. Once different software systems to exchange data via a
the components of a partitioned remerge, the processes or common set of exchange formats, to read and write the
replicas in the merged components must communicate same file formats, and to use the same protocols. As a
their states, perform state transfer and reach a global result, in order to be useful, dependable middleware
consistent state [5]. should be interoperable [2]. Through interoperability,
As another example, certain distributed fault- dependable middleware can provide a platform-
tolerance techniques are aimed at adopting dynamic independent way for applications to interact with each
linear voting protocol to ensure replica consistency in other [2]. In other words, two systems running on the
partitioned networks [5]. Voting protocols are based on different middleware platforms can interoperate with
quorums. In voting protocols, each node is assigned a each other even when implemented in different
number of votes. When a network is partitioned or programming languages, operating systems, or hardware
remerged, if a majority of the last installed quorum is facilities [2].
connected, a new quorum is established and updates can Another important problem concerns transparency.
be performed within this partition [5]. Dependable middleware should provide some form of
transparency to applications [2]. It allows dynamically to
add to an existing distributed application and to interfere
. DEPENDABLE MIDDLEWARE as little as possible with applications at runtime.
Therefore, many existing applications can benefit from
In the past decade, middleware has emerged as a major the dependable middleware [2]. Traditional middleware is
building block in supporting the construction of built adhering to the metaphor of the black box.
distributed applications [14]. The development of Application developers do not have to deal explicitly with
distributed applications has been greatly enhanced by problems introduced by distribution. Middleware
middleware. Middleware provides application developers developed upon network operating systems provides
with a reusable software layer that relieve them from application developers with a higher level of abstraction.
dealing with frequently encountered problems related to The infrastructures diversities are hidden from both users
distribution, such as heterogeneity, interoperability, and application developers, so that the system appears as
security, scalability, and so on[14][15][16][17]. a single integrated computing facility [16].
Implementation details are encapsulated inside the Although transparency philosophy has been proved
middleware itself and are shielded from both users and successful in supporting the construction of traditional
application developers, so that the infrastructures distributed systems, it cannot be used as the guiding
diversities are homogenized by middleware [18] [19] [20] [21]. principle to develop the new abstractions and
These attractive features have made middleware an mechanisms needed by dependable middleware to foster
important architectural component in the distributed the development of dependable distributed systems when
system development practice. Further, with applications applied to the todays computing settings[15][18][19]. As a
becoming increasingly distributed and complex, result, it is important to adopt an open implementation
middleware appears as a powerful tool for the approach to the engineering of dependable middleware
development of software systems [14]. platforms in terms of allowing inspection and adaptation
Recently, a strong incentive has been given to research of underlying components at runtime[22][23][24][25].
community to develop middleware to provide fault With networks becoming increasingly pervasive, major
tolerance to distributed applications [2]. Middleware system requirements posed by todays networking
support for the construction of dependable distributed infrastructure relate to openness and context-awareness
systems has the potential to relieve application developers [14]
. This leads to investigate new approaches for
from the burden by making development process faster middleware with support for dependability and context-
and easier and significantly enhancing software reuse. aware adaptability. However, in order to provide
Hence, such middleware can render building blocks to be transparency, traditional middleware must make
exploited by applications for enforcing dependability decisions on behalf of the application. This is inevitably
property [2].

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1255

done using built-in mechanisms and policies that cater for generation and to construct smart measurement system,
the common case rather than the high levels of demand-side response, and distribution automation to
dynamicity and heterogeneity intrinsic in todays transmission grid intelligence [33][34][35][36][37][38]. With the
networking environments[16][19]. The application, however, advent of smart grid era, electric power systems are
can normally make more efficient and better quality confronted with new challenges. The diversity and scale
decisions based on application-specific information that of networking environments and application domains has
could enable the middleware to execute more efficiently, made smart grid and its association with applications
in different contexts[16][19]. As such, it is not appropriate to highly distributed and complex. Future smart grid
place the whole responsibility for adaptation on the applications are expected to operate in environments that
dependable middleware [16] [19]. Dependable middleware, are highly distributed and dynamic with respect to
instead, may interact with the application, making the resource availability and network. As consequence, the
application aware of execution context changes and likelihood of faults undoubtedly increases. This gives a
dynamically tuning its own behavior using valuable strong incentive to researchers and engineering
application information [16] [19]. practioners to investigate the construction of dependable
Reflection offers significant advantages for building distributed systems. Further, it leads to leave much work
dependable middleware in the todays computing settings to be done before smart grid technology is fully enabled.
[26] [27] [28] [29] [30]
. Reflection is a principled technique In this paper, we attempt to explore some key issues
supporting both inspection and adaptation [26] [27] [28] [29] [30]. related to building a dependable distributed system for
A reflective dependable middleware system can bring smart grid application. In particular, we focus on the
modifications to itself by means of inspection and dependability matters, major challenges and distributed
adaptation [26] [27] [28] [29] [30]. On the one hand, through tolerance techniques for the construction of dependable
inspection, the internal behavior of dependable systems. Still, we also look deeper into the systematic
middleware is exposed, so that it becomes approach to providing the dependable support for smart
straightforward to insert additional behavior to monitor grid applications. In addition, we introduce basic
the middleware implementation. On the other hand, concepts and techniques for fault-tolerance. While some
through adaptation, the internal behavior of dependable of the insights might seem rather intuitive in hindsight,
middleware can be dynamically tuned, by modifications we think that these views are often sadly neglected in the
of existing features or by adding new ones [26] [27] [28] [29] [30]. development of dependable distributed applications. It is
As consequence, the reflection technique can support our sincere hope that dependable middleware
more open and configurable dependable middleware. implementers and application developers will benefit
Reflection mechanism can enable dependable middleware from our knowledge and contributions and that our
systems to inspect or change the way the underlying insights will help to shape the future of dependable
environment processes the application [31] [32]. Through infrastructures for middleware and distributed
reflection mechanism, dependable middleware systems applications.
can acquire information about their execution context and Although existing research efforts have addressed
adapt their behaviors accordingly [31] [32]. some issues related to the construction of the dependable
In addition to being application-transparent, distributed systems for smart grid, many issues require
dependable middleware also needs to provide a simple further investigation. Some open issues, such as
interface that allows applications to specify desires about combining dependability and real-time, combining fault
the dependability and to provide automatic detection of tolerance and security, combining replication and group
and recovery from faults [2]. Besides, when the communication, combining legacy applications and
dependability requirements dynamically change at dependable middleware, still remain to be addressed by
runtime, the dependability mechanisms may change the developers of dependable distributed systems for
during the execution of an application [2]. Therefore, smart grid[5][11].
dynamic reconfigurability is also required in the
dependable middleware. Dynamic reconfigurability can
be achieved by adding a new behavior or changing an ACKNOWLEDGMENT
existing one at system runtime. Dependable middleware
The authors would like to acknowledge the anonymous
capable of supporting dynamic reconfigurability needs to
detect changes in dependability requirements and the reviewers. This work has been partially funded by
faults that occur in the environment, and reallocate National 11-5th High-Tech Support Program of China
(2006BAH02A0407) and key Technology Support
resources, or notify the application to adapt to the
Program of the State Grid of China (WG1-2010-X).
changes [2].

REFERENCES
. CONCLUSION AND FUTURE WORK
[1] [1] Qilin Li, Wei Zhen, Minyi Wang, Mingtian Zhou, Jun
Over the few years, increasing attention has been given He, Researches on key issues of mobile middleware
to smart grid. As a new paradigm in the power grid, smart technology, Proceedings of the 2008 International
grid exploits the latest information and communication Conference on Embedded Software and Systems Symposia
technologies to accommodate renewable energy

2012 ACADEMY PUBLISHER


1256 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

ICESS2008 Chengdu, China, July 2008, IEEE System for Mobile Applications, IEEE Transactions on
Computer Society, pp. 333 - 338 Software Engineering, 29 (10). Pp.929-945
[2] [2] J. Ren, AQuA: A Framework for Providing Adaptive [20] Licia Capra,Gordon S.Blair,Cecilia Mascolo, Exploiting
Fault Tolerance to Distributed Applications, PhD thesis, reflection in mobile computing middleware , ACM
University of Illinois at Urbana-Champaign, 2001 SIGMOBILE Mobile Computing and Communications
[3] V [3] Valerie Issarny, Mauro Caporuscio, Nikolaos Review,2002,10,6(4):pp.3444
Georgantas, A Perspective on the Future of Middleware- [21] F. Kon, F. Costa, G. Blair, et al, The case for reflective
based Software Engineering, Future of Software
middleware , Communications of ACM, 2002, 45(6):
Engineering 2007, L. Briand and A. Wolf edition, IEEE-
CS Press. 2007 pp.3338
[4] C.Laprie, editor, Dependability: Basic Concepts and [22] Smith B., Reflection and Semantics in a Procedural
Terminology, Springer-Verlag, Vienna, 1992 Programming Language. PhD thesis Jan. 1982, MIT Press
[5] Qilin Li, Wei Zhen, Mingtian Zhou Middleware for [23] P. Maes. Concepts and Experiments in Computational
Dependable Computing, Proceedings of the 2008 Reflection. In Norman K. Meyrowitz, editor, Proceedings
International Conference on Embedded Software and of the 2nd Conference on Object-Oriented Programming
Systems Symposia ICESS2008 Chengdu, China, Systems, Languages, and Applications (OOPSLA87),
volume 22 of Sigplan Notices, pages 147156, Orlando,
July 2008, IEEE Computer Societypp.296-301
Florida, USA, October 1987. ACM
[6] Kurt Geihs, Middleware challenges ahead, IEEE [24] Yang Sizhong, Liu Jinde, Luo Zhigang., RECOM: A
Computer, June 2001, 34(6): 2431 Reflective Architecture of Middleware, Proceedings of
[7] S.Krishnamurthy, An Adaptive Quality of Service Aware International Conferences on Info.-tech.and Info.-net.,
Middleware for Replicated Services, PhD thesis, Beijing, October, 29, 2001, pp.339-344
University of Illinois at Urbana-Champaign, 2002 [25] W. Cazzola, et al, Architectural Reflection: Bridging the
[8] P.Narasimhan, Transparent Fault Tolerance for CORBA, Gap Between a Running System and its Architectural
PhD thesis, University of California at Santa-Barbara, Specification, in proceedings of 6th Reengineering Forum
1999 (REF'98), Firenze, Italy: IEEE. 1998
[9] S.Maffeis, D.C.Schmidt, Constructing Reliable [26] P. Maes, Computational Reflection, PhD, Vrije
Distributed Systems with CORBA, IEEE Universiteit Brussels, 1987
Communications Magazine, 35(2): pp.56-60, Feb.1997 [27] W.Cazzola, Evaluation of object-oriented reflective
[10] P.Felber, The CORBA Object Service: a Service model, In Proceedings of ECOOP Workshop on
Approach to Object Groups in CORBA, PhD thesis, Reflective Object-Oriented Programming and Systems
Swiss Federal Institute of Technology at Lausanne, (EWROOPS'98), Brussels, Belgium, Jul.1998
Switzerland, 1998 [28] G. Kiczales, Beyond the Black Box: Open
[11] P.Felber, P.Narasimhan, Experiences, Strategies, and Implementation, in IEEE Software. p. 8-11. 1996
Challenges in building Fault-Tolerant CORBA Systems, [29] G. Kiczales, J.D. Rivieres, and D. Bobrow, The Art of the
IEEE Transactions on computers, 53(5):pp.497-511, Metaobject Protocol: MIT Press. 1991
May.2004 [30] T. Schfer, "Supporting Metatypes in a compiled,
[12] B.Natarajan, A.Gokhale, S.Yajnik, D.C.Schmidt, DOORS: reflective programming language, PhD thesis, Dept. of
Towards High-performance Fault Tolerance CORBA, in Computer Science, Trinity College Dublin, Dublin, 131.
Proceedings of the 2nd Distributed Applications and 2001
Objects(DOA) conference, Antwerp, Belgium, Sep. 21-23, [31] J. Dowling, V. Cahill, The K-Component Architecture
2000 Meta-Model for Self-Adaptive Software, Proceedings of
[13] E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, Reflection 2001, LNCS 2192, 2001
and David B. Johnson, "A Survey of Rollback-Recovery [32] John Keeney and Vinny Cahill, Chisel: A Policy-Driven,
Protocols in Message-Passing Systems", in ACM Context-Aware, Dynamic Adaptation Framework,
Computing Surveys, 34(3):pp.375-408, September 2002 Proceedings of the 4th IEEE International Workshop on
[14] Valerie Issarny, Mauro Caporuscio, Nikolaos Georgantas, Policies for Distributed Systems and Networks (Policy
A Perspective on the Future of Middleware-based 2003), Lake Como, Italy, 2003, pp. 314
Software Engineering, Future of Software Engineering
[33] Zhong JinZheng Rui-minYang Wei-hongFelix Wu,
2007, L. Briand and A. Wolf edition, IEEE-CS Press. 2007
[15] G.S. Blair, G.Coulson, P.Robin, M..Papathomas, An Construction of Smart Grid at Information Age ,
Architecture for Next Generation Middleware, Power System Technology, 2009 33 (13). Pp.12-18(in
Proceedings of the IFIP International Conference on Chinese)
Distributed Systems Platforms and Open Distributed [34] Research Reports International, Understanding the smart
Processing (Middleware'98), The Lake District, UK, pp. grid, RRI00026
191-206, 15-18 September 1998 [35] The National Energy Technology Laboratory, Modern
[16] Licia Capra, Wolfgang Emmerich, Cecilia Mascolo, grid benefits, Pitt sburgh , PA , USA : NETL , 2007
Middleware for Mobile Computing, UCL Research Note [36] The Electricity Advisory Committee, Smart grid
RN/30/01, Submitted for publication, July 2001 Enabler of the new energy economy[EB/OL], 2008-12-
[17] Abdulbaset Gaddah, Thomas Kunz, A survey of 01[2009-04-20]
middleware paradigms for mobile computing, Department [37] Jing PingGuo Jian-bo Zhao Bo Zhou FeiWang
of Systems and Computer Engineering Carleton University, Zhi-bing, Applications of Power Electronic
Tech Rep:SCE-03-16, 2003 Technologies in Smart Grid, Power System Technology,
[18] Guanling Chen, David Kotz, A Survey of Context-Aware
200933 (15). Pp.1-6(in Chinese)
Mobile Computing Research, Dartmouth Computer
Science Technical Report TR2000-381, 2000 [38] Zhang Wen-liangLiu Zhuang-zhiWang Ming-jun
[19] Capra, L. and Emmerich, W. and Mascolo, C. (2003) Yang Xu-sheng, Research Status and Development
CARISMA: Context-Aware Reflective mIddleware

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1257

Trend of Smart Grid , Power System Technology,


200933 (13). Pp.1-11(in Chinese) ZHOU Mingtian was born in 1939, Guangxi Province,
China. He received EEBS degree from Harbin Institute of
Technology, Harbin, China, in 1962. He became a faculty with
UESTC in 1962. He is now a professor and doctoral supervisor
of College of Computer Science and Engineering, UESTC,
Senior Member of IEEE, Fellow of CIE, Senior Member of
FCC, Member of Editorial Board of <Acta Electronica Sinica>
and <Chinese Journal of Electronics> and TC Member of
LI Qilin was born in 1973, Chongqing City, China. He
Academic Committee of State Council. He has published 13
received the PhD degree in computer science from University of
books and 260 papers. His research interests include Network
Electronic Science and Technology of China (UESTC), in 2006.
Computing, Computer Network, Middleware Technology, and
He is now vice director in production and technology
Network and Information System Security.
department of Sichuan Electric Power Science and Research
Institute. His research interests include Smart Grid, Electric
Power Automation Dependable Distributed Middleware
Systems, and Multi-Agent Cooperation Systems.

2012 ACADEMY PUBLISHER


1258 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

The Application of SPSS Factor Analysis in the


Evaluation of Corporate Social Responsibility
Hongming Chen
Department of Economics and Management, Changsha University of Science and Technology, Changsha, China
Email: chmdsh@163.com

Xiaocan Xiao
Department of Economics and Management, Changsha University of Science and Technology, Changsha, China
Email: adashaw89@gmail.com

AbstractAccording to the basic idea and the model theory


of factor analysis, this thesis explores the applicability of II. PRINCIPLE OF SPSS FACTOR ANALYSIS
factor analysis in the evaluation of corporate social MODEL
responsibility. With the detailed examples of the applying of
SPSS factor analysis showed in this paper, it is showed the Factor analysis is a multivariate statistic method which
rationality of SPSS factor analysis being applied to social starts from the research related to the dependence of the
responsibility assessment of thermal power corporate. By internal variables, and concludes the numerous complex
essentially analyzing and evaluating the results of SPSS variables into a few comprehensive factors. The basic
software running out, the most simple and suitable standard
idea of factor analysis is that through the study of the
evaluation model of corporate social responsibility of
thermal power is determined deservedly, and at last this internal structure of variable correlation matrix or
thesis puts forward the prospects of solving the covariance matrix, a few random variables which are
shortcomings of the model. used to describe the relationship between multiple
variables must be found out. Then group the variables
Index TermsSPSS, factor analysis, corporate social according to the level of the relevance among them, and
responsibility make sure the correlation between the variables in a same
group is high and the variable between groups is low. So
I. INTRODUCTION each type of variable actually represents a basic structure,
The Statistical Product and Service Solutions (SPSS), namely common factor. These common factors are the
statistical software developed by SPSS Company of basic structures which influence the correlation between
America, has the capabilities of basic and advanced the original variables. In social statistics, it is necessary to
statistics. It is well known as professional statistical identify and summarize some main factors from the
software with widely applications in many fields, such as internal of complex realities, and the characteristics of
communication, medical treatment, bank, security, these common factors are to help us fully grasps what as
insurance, manufactory, commerce, market research, they are and find out the laws.
scientific research and education, etc. By studying the In the factor analysis, when the common factor is not
common and special reasons of variables change, factor obvious, the factor rotation method must be used to get
analysis simplifies variables structures. It is a some unrelated common factors in order to make such
multivariate statistical method derived from educational common factors more decentralized. That means the first
psychology. Recent years witnessed the booming of common factor is representative of one part of the
factor analysis in several fields, such as psychology, variables, and so does the second common factor, and so
medical science, meteorology, and economics, etc. By on. Such dealing makes sure that each common factor has
analyzing the dependency of related matrixes of the obvious practical implications, which is conductive to
source variables, factor analysis transfers variables in analysis and interpretation. So factor analysis is a method
complex relationships into a few integrated factors in the which tries to use the minimum number of unpredictable
principle of dimensionality reduction. Against the factors in the so-called linear function of the common and
background of various evaluation mechanisms of special factors and to describe the original observations
corporate social responsibility, this thesis describes how of each component. It divides each original variable into
to use SPSS to perform factor analysis on the fulfillment two parts: one consists of a few factors which are shared
of corporate social responsibility, and represents the by all variables, namely common factor; the other part is
formula to calculate the standard value of common only one variable impact, for a specific variable, namely
factors based on the analysis results. At last, the thesis the special factor. If the special factor is zero, then it is
establishes a comprehensive evaluation model of called the main component analysis.
corporate social responsibility for heat-engine plants. Generally, the Factor analysis model is:

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1258-1264
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1259

X 1 = a11F 1 + + a1nFn + e1 3) Make use of the rotation method to have


variable factors preferably interpreted, that is rotating.
X 2 = a 21F 1 + + a 2 nFn + e 2 Establishing factor analysis model is designed not only
to find the common factor but also to learn the meaning
of each common factor that is more important, and to
Xm = am1F 1 + + amnFn + em evaluate and analysis all private listed companies
In the above formula, Xi (i=1, 2, , m) is Measured operating results. But regardless of which method to
variable; aij (i=1,2,, m; j=1, 2, , n) is factor loading; determine the initial factor loading matrix A, is not the
(i=l, 2, , n)is common factor; ei (i=l, 2, , m) is only, just only an initial value. If the initial factor from
specific factor. Factor loading is the correlation each can be better identified the representatives of the
coefficient of the first I variable and the first j factor, and original indicators, we can give a reasonable economic
it reflects importance of the first I variables to the first j interpretation of these factors, then take the next step
factor. If the load is larger, the first I variable is in the analysis. But if the factor load capacity is relatively
close relationship with the first j factor instead, they average, it is difficult to distinguish which index and
alienated. In the factor model, the common factor factor is more closely linked, from the initial factor load
between each other is not related, and the special factors matrix and get the initial factor solution to see the master
is also irrelevant, and it is also irrelevant between special factor is the typical representative of variables is not very
factor and common factor. Thus, starting from a group of prominent, easy to make the factors of ambiguity in the
original observation variables, factor analysis that is a meaning, cannot find evaluation of the object on each
statistical method used to find potential factors, is about factor scores from the original index on the reasons of the
to analyze the common factors and special factors, and differences, it is not easy problem analysis. Then we need
find out the corresponding loading matrix, and then factor rotation, is to facilitate the interpretation of the
explain the meaning of every common factor. public factor meaning, is required for the initial master
The first task of factor analysis is to construct a factor factor to the linear combination, so that each variable in
model with the model parameters determined, and to only one common factor on a greater load, whereas in the
explain the results of factor analysis; the second task of rest of the public factor on the load, or medium size, in
factor analysis is to estimate the common factors, and order to find the common factor of the more clear
analysis them to a further extend. Therefore, surrounding economic meaning .You can create a new set of common
this core problem, the basic steps and solution idea of factors following a linear combination, , and these
factor analysis is clear. Factor analysis often has the factors are independent of each other, but also can well
following basic steps: explain the relationship between the original variables.
1) Confirm whether the original variables are F 1 = d 11F 1 + d 12 F 2 + + d 1 pFp
suitable for the factor analysis, namely inspection steps.
The determination method used in this paper is mainly F 2 = d 21F 1 + d 22 F 2 + + d 2 pFp
Bartlett Test of Sphericity and KMO (Kaiser-Meyer- "
Olkin). The tested statistic of Bartlett Test of Sphericity
derives from related coefficient matrix. If the value is F p = dp1F 1 + dp 2 F 2 + + dppFp
great, and the corresponding companion probability value In fact the so-called "rotation" in factor analysis is the
is less than the established level of significance, there is a distribution of variable information once again, which is
correlation between the original variables. Tested statistic like that adjusting the focus of a microscope, in order to
of KMO is used to compare simple correlation and partial see the fine things. It is characterized not to increase or
correlation relationship between variables, and the value decrease the amount of information of things, only
is between 0-1. The more closely it is to 1, the more through the appropriate adjustment made things as they
suitable the variables for factors analysis. And all that are clearer.
simple correlation squares far outweigh the partial 4) Calculate the factor variable scores.
correlation square between the variables. Factor score, is the score of common factor scores in
2) Build factor variables, namely extraction of every sample point. It needs to build a linear expression
common factors. for common factor in its expression by some original
Standardize the original variable data, and calculate variables, and then substitute the value of original
their correlation matrix, and analyze the correlation variables into the expression, the factor score can be
between the variables. The number of common factor is worked out easily. The establishment of regression
based on a suitable proportion of the information content equations which utilizes the common factor as the
of the selected common factors to the original indexes to dependent variable, and the original variables as the
determine. The proportion of The first i common factor independent variable, as follows:
keeps original data information to the total amount is , Fj = j1 X 1 + j 2 X 2 + + jnXn , j=1, 2, , p
namely the contribution rate of the first i common factor
on the original data. The selection of number of common Under the meaning of the least square method, the
factor is according to their contribution size of common estimate value of common factor F can be obtained,
factors, when the cumulative contribution rate is greater F = AR 1 X .In this above formula, A is the factor
loading matrix, A is the transpose of the after turn
than a predetermined standard, it is the point.

2012 ACADEMY PUBLISHER


1260 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

factor loading matrix, R is the original variables according to certain procedures, not by subjective
1 experience, so by using factor analysis method, the effect
correlation matrix, R is the inverse matrix of R , X is of subjective factor will be avoided, index weight will be
for the primitive variable vector. determined scientifically, and the evaluation index system
Comprehensive evaluation score. Set Wi i=1, 2, , of corporate social responsibility will be constructed
m being for the comprehensive evaluation score of reasonably, thus the social responsibility will be evaluate
sample enterprise, then the comprehensive evaluation objectively and scientifically.
mathematical equations of samples which is got from 4) The factor analysis could be processed by the
Weighted summary of each factor, as follows: computer statistical analysis software, which is strongly
feasibility and operability. Using the statistical software
Wi = 1F 1 + 2 F 2 + + pFp SPSSl7.0 , the beforehand typed-in index data will be
i is for the weight of each factor which is equal to the handled by factor analysis, such as statistically inspect,
contribution of each common factor to the total variance extract common factor, output data results, etc, and the
i = i calculation process is simple and convenient, easy to
i=1i .
p

contribution of p common factors, namely operate.


Being based on the principle and characteristics of
III. EXPLORATION OF THE APPLICABILITY OF SPSS factor analysis, it is reasonable for factor analysis to be
FACTOR ANALYSIS IN CORPORATE SOCIAL RESPONSIBILITY applied to evaluating the corporate social responsibility.
EVALUATION The factors which affect the performance of social
responsibility of the enterprise is so numerous and
In theory, factor analysis can be a good way to complicated that it is not possible and unnecessary to
evaluate it with regard to the corporate social embrace all the complicated factors. For the limitations of
responsibility . First ,because of the diversity and existing research and the difficultly acquiring of data
complexity of the performance of corporate social information, this article only takes ten financial indexes
responsibility and the strong correlation between every to construct the evaluation model. Then choose the
aspect of its performance it is complex and difficult to important influence factors and determine the sequence of
determine the evaluation index of social responsibility. impact factor through the reduced-order processing of
Secondly, the universality of social responsibility requires factor analysis, and the following is using factor score
that evaluation indexes must be established as much as model to comprehensively evaluate the samples. At the
possible while the establishing of a large number of same time, according to the factor analysis, the main
evaluation indexes must waste a lot of costs inevitably, factors and the secondary factors will be distinguished
which is to make the evaluation a mere formality finally. from each other.
Third, the influence on evaluation results which is caused
by the determination of weighing values of the indicators IV. PRACTICAL APPLICATION OF SPSS FACTOR
is of great. On the one hand, it may can cause error when ANALYSIS
evaluate the object of study too high or too low, so that
the final results of the evaluation cannot truly reflect the A. The computer operating process
actual level; On the other hand, evaluation object will 1) Input the 3 years data of 10 indexes from the 10
probably only pursuit some index with higher weight or sample enterprises into the table in the SPSS.
make use of manipulating data, and inflate some 2) Click on the menu items in the sequence of
evaluation indexes. Factor analysis with its features is a Analyze Data Reduction Factor, open the Factor dialog
good solution to the problems mentioned above, and Box.
while dealing with social responsibility problem with 3) In the Factor Analyze dialog box, click the
multiple variables indexes, factor analysis method has its
own incomparable advantages: variables as X 1 X 10 the analytical variables.
1) The factor analysis is one of the multivariate 4) Click the Descriptives button, and select the
statistical methods in social statistics, and it can solve Univariate descriptives in the Statistics option, then select
such problems towards corporate social responsibility Coefficients, Significance levels , KMO and Bartletts
which is multi-dimensional, multi-variable and test of sphericity in the option of Correlation Matrix, and
complexed. Through classifying the complex indexes, it then click on Continue to the main dialog box .
will carry the find out common factor on the real 5) Click on the Extraction button, select the
diagnosis analysis and comprehensive evaluation. Principal components in Method, and select the
2) Factor analysis method will simplify the process Correlation Matrix in the Analyze option, and then select
analysis by classifying the object. Factor analysis method the Unrotated factor solution and Scree plot in the option
can solve the complex relationship between indexes, and of Display, click on the Continue button to return to the
decrease evaluation costs which is caused by numerous main dialog box .
index. 6) Click on Rotation button in the dialog box, and
3) Factor analysis method can avoid defects which then select the varimax method to do the rotation of
is caused by subjective factors determine. In the process factor loading, and select Rotated solution in Display
of determining weight index, index is calculated option , then click on the Continue button to return to the
dialog box.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1261

7) Click on the Scores button in dialog box, and SDIC Huajing Power 0.117747 5
select Display factor score coefficient matrix, and click SP Power Development 0.243617 4
on Continue button to return to the main dialog box.
Datang Power -0.02443 6
8) Click the OK button in the dialog box, indicate
the results, which is shown in the TABLES. Anhui Power -0.36465 7
Shenneng Stock 0.637323 2
TABLE I. KMO AND BARTLETTS TEST Huadian energy -0.94325 10
aiser-Meyer-Olkin Measure of Sampling .704 Huaneng Power International -0.79174 8
Adequacy Huadian Power International -0.89245 9
Bartletts Test of Approx.Chi-Square 331.855
Sphericity

df 45 TABLE VI. 10 SAMPLE ENTERPRISES' AVERAGE F2 SCORES


Sig. .000 RANKING IN 3 YEARS

Enterprise F2 Rank
TABLE II. TOTAL VARIANCE EXPLAINED
Shenzhen energy -0.29596 7

Extraction sums of Rotation sums of squared Guangdong power A 1.645917 1


SDIC Huajing Power -0.20985 5
squared loadings loadings
SP Power Development 0.43984 4
Datang Power -0.21498 6
varian cumulativ variance cumulat Anhui Power 0.96417 2
total ce % e% total % ive % Shenneng Stock -1.9632 10
5.026 50.265 50.265 4.339 43.385 43.385 Huadian energy 0.60599 3
1.891 18.910 69.175 2.047 20.473 63.858 Huaneng Power International -0.60308 9
1.315 13.154 82.329 1.847 18.470 82.329
Huadian Power International -0.36884 8

TABLE III. ROTATED COMPONENT MATRIX


TABLE VII. 10 SAMPLE ENTERPRISES' AVERAGE F3 SCORES
RANKING IN 3 YEARS
component
1 2 3 Enterprise F3 Rank
X1 .979 -.152 -.023 Shenzhen energy 0.5514 3
X2 .970 -.146 -.008
X3 .945 -.229 -.095 Guangdong power A -0.34414 7
X4 .932 -.229 -.039 SDIC Huajing Power 1.73815 1
X5 -.235 .878 .001
SP Power Development 0.587587 2
X6 -.114 .861 .389
X7 .557 -.610 .364 Datang Power -0.15524 4
X8 .266 .062 .769
Anhui Power -0.90722 10
X9 -.104 .043 .730
X10 -.467 .096 .655 Shenneng Stock -0.37384 8
Huadian energy -0.25067 5
Huaneng Power International -0.30352 6
TABLE IV. COMPONENT SCORE COEFFICIENT MATRIX
Huadian Power International -0.54251 9
component
1 2 3 TABLE VIII. 10 SAMPLE ENTERPRISES' AVERAGE
X1 .259 .103 -.004 COMPREHENSIVE SCORES RANKING IN 3 YEARS
X2 .257 .104 .004 Rank
X3 .233 .053 -.038 Enterprise Comprehensive score
X4 .229 .045 -.006
Shenzhen energy 0.874169 1
X5 .108 .513 -.070
X6 .139 .495 .146 Guangdong power A 0.571365 2
X7 .040 -.306 .249 3
SDIC Huajing Power 0.399807
X8 .088 .031 .420
X9 -.025 -.053 .401 SP Power Development 0.369577 4
X10 -.121 -.086 .357 Datang Power -0.10116 5

Anhui Power -0.15592 6


TABLE V. 10 SAMPLE ENTERPRISES' AVERAGE F1 SCORES
RANKING IN 3 YEARS Shenneng Stock -0.23621 7

Huadian energy -0.40261 8


Enterprise F1 Rank
Shenzhen energy 1.563773 1 Huaneng Power International -0.63529 9

Guangdong power A 0.454057 3 Huadian Power International -0.68373 10

2012 ACADEMY PUBLISHER


1262 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE IX. THE F1 SCORES RANKING OF SAMPLE ENTERPRISES IN B. Process explanation


THE SAME YEAR
1) Inspection. (as table I)
Enterprise Fiscal year F1 rank The index data of social responsibility of the thermal
SDIC Huajing Power 0.25534 3 power is processed to the factor analysis by SPSS17.0
2008
Shenzhen energy 2008 0.92093 1 software, and because of the different dimension of each
6 social responsibility index, the sample data should be
SP Power Development 2008 -0.34221
standardized and processed to applicability test. Being
Shenzhen energy 2.40115 1
2009 tested by KMO and Bartlett test ,the data for inspection
Guangdong power A 1.39809 2 reaches the standard of factor analysis with the KMO
2009
SP Power Development 2009 0.69064 4 statistical value of 0.704 and Bartlett inspection
1 significant level value of 0.000 < 0.005. Therefore, it is
Shenzhen energy 2010 1.36924
suitable that sample data is for factor analysis.
Guangdong power A 0.60804 2
2010 2) Extraction of common factor. (as table II)
SP Power Development 0.38242 4 The total variance explained table containing the
2010
characteristic value comes after the extraction of common
TABLE X. THE F2 SCORES RANKING OF SAMPLE ENTERPRISES IN factor by princomp method. Which is determined by
THE SAME YEAR
corresponding to the standard of characteristic root value
Enterprise Fiscal year F2 rank greater than 1, the first three principal component have
SDIC Huajing Power -0.04855 5 enough description to most of the indexes information.
2008
And in the table the variance contribution rate of the first
Shenzhen energy -0.29529 7
2008 three main components are 43.385%, 20.473% and
SP Power Development 0.73822 4 18.470% respectively, and the total variance contribution
2008
Shenzhen energy 2009 0.02595 5 rate respectively are 43.385%, 63.858%, 82.329%,
Guangdong power A 1.89974 1 namely that the social responsibility of thermal power
2009
could be explained effectively by the three principal
SP Power Development 0.83322 3
2009 component.
Shenzhen energy -0.61854 7 3) Rotation. (as table III)
2010
Guangdong power A 2010 1.70093 1 For the got common factors being explained more
SP Power Development -0.25192 4 easily, the principal component matrix must be conducted
2010
by the process of variance maximizing rotation, then the
TABLE XI. THE F3 SCORES RANKING OF SAMPLE ENTERPRISES IN
variance components matrix would be formed. The
THE SAME YEAR economic significance of the factors respectively are
named after the size of the variables load on the common
Enterprise Fiscal year F3 rank
factors. From the statistics given in the table, the load of
SDIC Huajing Power 3.20559 1 main factors F1 on the four indexes X1, X2, X3 and X4
2008
Shenzhen energy 2008 0.68671 4 are all more than 90%, being maximum among them.
SP Power Development 1.6434 2 These indexes demonstrate comprehensively the
2008
2
economy responsibility of enterprises, or it is that being
Shenzhen energy 2009 0.62791
responsible for shareholders and creditors, and explain
Guangdong power A -0.613 8
2009 the financial performance responsibility of the enterprise
SP Power Development 0.43809 3 and the ability to create value, naturally name it economic
2009
Shenzhen energy 0.33958 2 responsibility factor. On the three indexes X5, X6 and
2010
4
X7, the main factor F2 is showed with the maximum
Guangdong power A 2010 -0.57178
factor loading . On account of the close connection
SP Power Development -0.31873 3
2010 between these indicators and the condition of the internal
environment, resources and employees, so name it the
TABLE XII. THE COMPREHENSIVE SCORES RANKING OF internal environment responsibility factor. In X8, X9 and
SAMPLE ENTERPRISES IN THE SAME YEAR
X10, the main factor F3 is on the maximum load. And
Enterprise Fiscal year Comp- score rank name it external environment responsibility factor,
SDIC Huajing Power 0.572524 1 because of involving the society and resources pollution
2008
2 of the external of enterprise and the outside Stakeholders,
Shenzhen energy 2008 0.565932
such as the government and the public.
SP Power Development 0.371927 3
2008 4) Score. (as table IV)
Shenzhen energy 1.412657 1 The score of each main factor of the each sample
2009
Guangdong power A 2009 1.071644 2 enterprise is got by the regression analysis process, and
3 the weight is assigned by the proportion of the variance
SP Power Development 2009 0.669429
contribution of each principal to the cumulative variance
Shenzhen energy 0.643918 1
2010 contribution of the three main factors, then naturally get
Guangdong power A 0.61512 2 the comprehensive scoring function,
2010
SP Power Development 0.067373 3
2010

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1263

W = 0.43385 F 1+ 0.20473 F 2 + 0.18470 F 3 0.82329 would


not ignore, such as SDIC Huajing Power, Shenzhen
Energy and Huadian Energy in table 5. It is showed that
weighted score those factors, and rank them on the basis the social responsibility of thermal power enterprise is
of the comprehensive got scores. largely displayed in its economic responsibility , and in
As table 4, the factor score model of this thesis sample other words, is whether the enterprise took the burden of
is : the responsibility of the shareholders and creditors well
F 1 = 0.259 X 1 + 0.257 X 2 + 0.233 X 3 + 0.229 X 4 + 0.108 X 5 + or not. The most important among them is economic
0.139 X 6 + 0.040 X 7 + 0.088 X 8 0.025 X 9 0.121 X 10 responsibility, and it is interpreted by the performance on
F 2 = 0.103 X 1 + 0.104 X 2 + 0.053 X 3 + 0.045 X 4 + 0.513 X 5 + profitability and debt paying ability, so it is the key point
0.495 X 6 0.306 X 7 + 0.031 X 8 0.053 X 9 0.086 X 10 to focus on profitability and debt paying ability for
F 3 = 0.004 X 1 + 0.004 X 2 0.038 X 3 0.006 X 4 0.070 X 5 + improving the overall social responsibility performance
.Thus that enhancing the independent R&D(research and
0.146 X 6 + 0.249 X 7 + 0.420 X 8 + 0.401 X 9 + 0.357 X 10
development) on the efficient resources utilization and
improving the technology innovation level is essential for
As table 2and 4, the comprehensive scoring model is:
thermal power enterprises to create more value with the
W = 0.43385 F 1+ 0.20473 F 2 + 0.18470 F 3 0.82329 support of national policy and increasing investment in
Form the size of coefficient before each factor in the R&D.
function, when comprehensively evaluate their social In addition, the influence of F2 and F3 on the overall
responsibility of the sample companies, F1 (economic performance of Corporate Social Responsibility society
responsibility factor) is the main influence factor, and F2 is a force to be reckoned with, and shows that it is
(internal environment responsibility factor) and F3 significant that the thermal power enterprise should lay
(external environment responsibility factor) also play emphasis on every internal aspect of their own
important roles in comprehensive score. According to management and control, improve the efficiency of
each factor score and the comprehensive factor score resource utilization and guarantee the staff's healthily
function, The comprehensive rank of the 10 thermal growth and development, and still should establish some
power sample enterprises is listed in the following table, harmonious and friendly relationship with the external
so as to compare the corporate social responsibility stakeholders, such as the government and the society, and
performance of such sample companies by the year 2008 with the ecological environment also. Specifically, the
to 2010. enterprises could observe the relevant code of conduct
5) Result Analysis. (as table V-XII) and the law, pay tax in accordance with law, provide
From the finally scores of table 5, there are only four internal staff with the healthy and safe working
companies whose the average of comprehensive scores in environment, distribute the shares of the pay to
three years of the social responsibility performance of the employees , offer employees some beneficial trainings,
sample companies are positive , and less than 50% of all undertake more the contribution of public welfare; On the
of the sample quantity. It can be concluded that the social other hand, Pursuit the coordinated development with
responsibility performance of the whole thermal power environment, abide by relevant laws and regulations of
industry was not enough good with a big difference environmental protection and resource conservation,
among those samples. From the rank listed in table 6, it establish and improve the environmental management
obviously shows that all the three years of both Shenzhen system, stick to energy conservation and emission
Energy and SP Power Development were actual reduction, continuously refine environmental protection
comprehensively scored in the top three, and the and energy saving work, actively respond to and avoid
comprehensive score of Guangdong Power A was in the environmental risk, Gradually realize the clean
top three in 2009 and 2010 two years. It explains that the development of the power etc. The electric power
above indexes of the three companies interpreted the enterprise, as the important state-owned enterprise, both
social responsibility performance being taking some well. for the development of the economy and society and for
Because of the strongest explanatory power of F1 in this its establishment of modern enterprise system and
evaluation system and the load of F1 on X1 (EVA) realization of the sustainable development, should
reaching 0.979 as the highest, it expresses that the EVA undertake social responsibility, and it is very important
rate greatly contributes to the intensity of a good and urgent.
explanation of social responsibility with the analysis to
the three companies also being ranked top from the V. CONCLUSION
original data of index X1, which is also reflected the Through factor analysis theory and examples of SPSS
rationality of the evaluation index system of the thermal software, it is certain that factor analysis should be
power enterprise society responsibility involved EVA widely used in the evaluation of corporate social
into. Beyond that, the score ranking of single factor have responsibility. Factor analysis method not only
some great influence on the comprehensive score ranking effectively resolve problem of determining the index
of enterprise in a greater positive correlation between weight in the corporate social responsibility evaluation ,
them. Especially F1, the comprehensive ranking of the but also can make the weights of evaluation indexes
enterprises is generally consistent with it. However, the setting fast and objective. In the process of social
differences caused by the ranking of F2 and F3 also do responsibility evaluation, it is acceptable to classify and

2012 ACADEMY PUBLISHER


1264 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

simplify evaluation indexes by factor analysis , and REFERENCES


extract common factors, and then synthetically rank [1] Gao Taishan. The Factor Analysis and Evaluation of State-
objects according to their got scores calculated by owned Enterprise Economic Benefit [J]. Business
comprehensive score model. Thus , the cost of the Management. 2008, (4)J.
evaluation process of corporate social responsibility will [2] Li Liqing. The Evaluation Theory and Empirical Research
be greatly cut, and avoid causing the waste of human, on Enterprise Social Responsibility: Take Hunan Province
financial and material. as An Example [J]. Southern Economic Journal , 2006,1
In this example, because of Each factor only reflecting [3] Li Xiaolong, Luo Liyan. On the Core Concept of
some certain aspects of the performance of thermal power Corporate Social Responsibility[J]. Jiangsu Commercial
enterprise social responsibility, therefore, in this paper, Forum, 2005,(8)
the weight is assigned by the proportion of the variance [4] Yin Yumin , Liu Wenchang . The Application of Factor
Analysis in the Evaluation of Enterprise Competitive
contribution of each principal to the cumulative variance Power [J]. Mathematical Statistics and Management, 2004,
contribution of the three main factors, the mathematical [5] Guo Zongyi , Huang Yi . The Thinking of Constructing
model are as follows: F= (0.43385*F1+ 0.20473*F2+ China's Green Financial Evaluation System [J]Friends of
0.18470*F3)/0.82329. The confirmation of the weighing Accounting, 2004, (4).
values is based on the analysis on the original data [6] Xue Wei. Statistical Analysis and SPSS Application [M].
through SPSS processing with some surely objectivity, China Renmin University Press, 2001.
and it synthesizes many indexes into a few basically [7] Sandra A. Wad dock & Samuel B. Graves. Green Paper:
uncorrelated comprehensive factors, so the purpose of Promoting a European Framework for Corporate Social
dimensional reduction and reducing the superposition and Responsibility, 2006.
redundancy of information among the evaluation indexes
is possible definitely. Through the calculation of
comprehensive factor score , the analysis of thermal
power enterprise social responsibility in three aspects is Hongming Chen, is a professor in the
clear, namely, the economic responsibility, the internal School of Economics and Management,
environment responsibility and the external environment Changsha University of Science and
responsibility. And with the specific to the problems Technology, Changsha city, P.R. China. He
represented in the score of the model, the improvement of received BS, MSc, PhD in Information
its social responsibility is accessible. System and Management from Hunan
This model illustrates the some certain rationality of University.
His major research interests include
the built evaluation system of the thermal power
accounting information system, business
enterprises social responsibility based on the ten selected administration. His papers appeared in many journals and
indexes through the factor analysis, but it may not International Conferences, such as Accounting Research,
integrate all indices, which can explain the performance International Forum on Computer Science-Technology and
of social responsibility, into the model and quantify them Applications, IFCSTA. Professor Chen had written several
because of the limitations of the model itself. And in books, for instance Computer accounting theory and
addition, the influence of qualitative index other than application(Hunan science and technology Press), Visual
some quantitative indexes is also very important, such as FoxPro 6.0 Design and accounting computerization
the quality of social responsibility management, the model(Hainan Publishing house).
Professor Chen is the member of Accounting Society of
credibility of financial statements, the technical level of
China, and the trustee of Accounting Society of Hunan Province
pollution control, the establishment of sustainable and Accounting Society of Changsha city. application(Hunan
development policies with relevant, the employee science and technology Press), Visual FoxPro 6.0 Design and
benefits and the security situation and so on. Thus , the accounting computerization model(Hainan Publishing house).
fitting some qualitative indexes valued scientifically into Professor Chen is the member of Accounting Society of
the model is necessary as far as possible in the future China, and the trustee of Accounting Society of Hunan Province
research, so as to fully consider the performance status of and Accounting Society of Changsha city.
thermal power social responsibility.
Of course, the practical application of SPSS software
factor analysis method requires a profound understanding
Xiaocan Xiao, received a bachelor
of the actual problems and the basic ideas of the method, degree of administration from Hunan
and a deep-going realization and a reasonable explanation university in 2010. She interests in the
of the output ,in addition with the according support to research of such field which mixes
decisions , so that ,the combination of scientific theory computer or software and company
knowledge and computer hardware and software management together. She has published a
technology will better and powerfully support to solve paper in quality journal.
practical problems.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1265

Relationship between Motivation and Behavior of


SNS User
Hui Chen
School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing, China
Email: Chen-hui@vip.sina.com

AbstractWith 225 SNS network users as the subjects, the websites, we have been faced with unprecedented
research focuses on the influence of SNS network users challenges. Whether the government should monitor the
motivation on their behavior. Data are collected through business mode or not? How can the SNS websites
questionnaires and the relationship between motivation and monitor their own development mode? How to regulate
behavior is studied with relative analytical methods. As
and guide the netizens? The answers lies on the research
shown in the study, information and instrument motivation,
entertainment and aesthetic motivation, social connection of the SNS websites users behaviours. Why do they use
motivation, altruism motivation, ascription and the SNS websites? What are the needs met by the SNS
identification motivation, and intrinsic motivation have websites? Why do they indulge in the forums,
significant positive correlation with usage time, usage information and games provided by the SNS websites? A
frequency and usage level of the users. research on these problems can help the government
evaluate the necessity of monitoring these SNS websites,
Index TermsSNS user, motivation, behavior and the methods of regulating the SNS websites, which
would therefore bring a healthy development to the SNS
I. INTRODUCTION websites. However, the researches on this research are
With the wide spread of Internet and the development still blank.
of Web 2.0, socializing network has become one of the This research focuses on the relationship between
most important basic applications for netizens. motivation and behavior of the netizens, through which it
At the end of 2009, the CNNIC (China Internet discusses what influences motivation has on behavior and
Network Information Center) issued Survey of Chinese how.
Netizens Use of Socializing Network-2009, which shows
that at present there are more than 1000 SNS websites II. LITERATURE REVIEW
and the number is increasing. By the end of 2009, the
number of netizens using the SNS has been up to 124 A. Concept of Social Network
million, one third of the total number of netizens in China. Being one of the most important applications of web
The various applications of SNS websites have been 2.0, SNS websites has been the focus and a hot topic in
affecting people in a lot of aspects, such as the habit of the internet industry. SNS (Social Network Service)
using the internet, living styles, interests and even the refers to one of the Internet applications which provide
social values. Take one popular SNS application Happy assistance in building online social network
Farm as an example, stealing vegetables have been The SNS can be divided into two types, according to
affecting our lives greatly. First, the approach has been a the standards and functions. Type I. Manage the real
big bang on Chinese preferences, which means a social network online. Such as Facebook ,Renren and
preference for a virtual mode of fun. Second, the concept kaixin . Type II. Connect people by the same interests,
of steal has challenged the Chinese ethics. Third, the such as Myspace and douban. The previous one manages
people playing this game tend to ignore the time and the social network on the real identities, while the latter
energy, affecting their work or study. Four, this social one is more of entertaining and making friends. This
networking game has been a hinder to family harmony. paper defines the SNS websites on the first one.
Last, there are also concerns on the fact that some
netizens use the SNS website to achieve some goals B. Research on virtual community users motivations
which is against the social ethics, such as a personal Dholakia et.al (2004) s research shows that the
attack, or a manipulation of media. personal motivations that influencing users of SNS
Therefore, the impact of the SNS websites on the websites are purpose value(including information value
society has been all round. With the booming of SNS and instrumental value), self-identification, interpersonal
relationship maintenance, social reinforcement and
Supported by Chinese University Scientific Fund entertainment value. Informative value means that the
(BUPT2009RC1022) participants acquire and share information in the virtual
community and have access to the others mind.

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1265-1272
1266 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

(Dholakia, 2004) Information value involves acquiring community, which is interests-oriented; on the other hand,
information, learning to do things, providing information, they want to contribute to the community, servicing for
and contributing to the community. Instrumental value is its development. The two factors codetermine the
to solve a problem or to create something new through members behaviors in the community. The members
online communication. (Hars&Ou, 2002; participation is defined by two dimensions: one is the
McKenna&Bargh, 1999) After surveying 280 SNS time the members spent on virtual community, the other
website users, Ohbyung Kwon, Yixing Wen (2009) one is the degree to which the members exchange with
suggests that the components of SNS website users each others actively.
motivations are social identity, altruism, and perceived In terms of participation pattern, Adler and Christopher
identification It points out that altruism include two types: (1998) classify virtual community members into four
kin altruism and reciprocal altruism. In the virtual world, categories: passive person (who wants to gain
kin altruism refers to the fact that the internet users tend entertainment or information without any effort), active
to share the secrets online, form the family-like person (who takes part in activities and discussions
relationship with other users and do things beneficial for initiated by others), inducer (who initiates discussions or
each other. Reciprocal altruism refers to the fact that the arranges activities to attract other members to take part
internet users help the others without sacrificing ones in), and manager (who is a mature inducer as an
own maxims and beliefs, hoping for a similar help in the intermediary between community members and
future. community operators such as a forum moderator ). In
After surveying 178 SNS website users, Chen & Yin terms of two factors, the relationship between members
(2010) holds that SNS users motivations can be and consume activity and that between members and
classified into six types: information and instrumentental virtual community, Kozinets (1999) classify virtual
motivation, recreational and aesthetic motivation, community members into several types: browser,
socializing motivation, altruism motivation, ascription socializer, contributor and intrinsic person. In terms of
and identification motivation, and intrisic motivation. the contribution level of members to the community,
This research adopts this classification pattern to analyze Wang and Fesenmaire (2004) classify the members into
the relationship between SNS users motivation and four categories: browser (who does not have strong social
behavior. connection with other members and seldom make
contributes to the community), socializer (who keeps
C. Relationship between motivation and behavior
certain social connection with the community group and
Motivation is a psychological concept, the cognition sometimes make contributes to the community),
of which has gone through different stages. Motivation is contributor (who keeps strong connection with the
the intrinsic cause for people to take part in certain community and have great enthusiasm to community
activity, that is, an intrinsic force that drives people to activities, and often make contributes to the community),
take actions so as to meet certain needs and reach certain intrinsic person (who keeps very strong social and
goals. personal connection with the community as an very
As for the relationship between motivation and active contributor).
behavior, Taylor and others (Taylor, Sluckin, Davies, In this research, SNS users behavior includes usage
Reason, Thomson & Colman, 1982) propose that time, usage frequency and usage level, and the concept of
psychologists interpret motivation as a process or a usage level cites the views of Wang and Fesnemaier
series of processes that will stimulate, guide, maintain (2004).
and finally ends at certain target-directed behavior
succession. III. RESEARCH TARGET AND RESEARCH HYPOTHESIS
Motivation has a complicated relationship with
behavior in addition to stimulating, guiding and A. Research Target
maintaining behavior. Behaviors of the same kind may
The research studies the relationship between SNS
have different motivations, i.e., different motivations are
manifested by the same kind of behavior, and different virtual community members motivation and their
behaviors may come from the same or similar motivation. behavior, SNS users motivation in this research comes
from the researchers previous research. (Chen, 2010), in
Psychologists often explain the differences in behavior
which the motivation is classified into information and
strength with the concept of motivation and consider
strong behavior as the result of high level motivation. instrumental motivation, recreational and aesthetic
Moreover, the concept of motivation is often used to motivation, social connection motivation, altruism
motivation, ascription and self-identification, and
indicate behaviors persistence. It is suggested that the
intrinsic. The user behavior is divided into time,
higher the motivation level, the longer the behavior will
be even with a relatively low strength. Then, whats the frequency and level.
relationship between SNS users motivation and behavior The operating definitions of every variable are as
below.
in the virtual world?
Information and instrumental motivation: Users
D. Research on Virtual community Members Behavior acquire or share the information in the internet
Wang and Fesenmaire hold that on one hand members community, and get the information they need.
in the community expect to gain certain value from the

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1267

Recreational and aesthetic motivation: To get pleasure Hypothesis H1: Members motivation has a positive
and relaxation through surfing the content within the correlation with their behavior.
internet community and interacting with other users. To H1.1: The stronger the members information and
get the aesthetic pleasure through designing the internet instrument motivation, the longer their usage time;
community (such as pictures, videos and flash) H1.2: The stronger the members entertainment and
Social connection motivation: Users communicate with aesthetic motivation, the longer their usage time;
people of the same opinions, avoid feeling lonely, and H1.3: The stronger the members social connection
receive friendship and social support within the virtual motivation, the longer their usage time;
community.. Users get recognition and acceptance from H1.4: The stronger the members altruism motivation,
other users for their contribution, therefore a promotion the longer their usage time;
of social status. H1.5: The stronger the members ascription and
Altruism motivation: It includes kin altruism and identification motivation, the longer their usage time;
reciprocal altruism. In the virtual world, kin altruism H1.6: The stronger the members intrinsic motivation,
refers to the fact that the users tend to share the secrets, the longer their usage time;
form the family-like relationship and do the things information and
beneficial for each other online. Reciprocal altruism instrument
motivation
refers to the fact that help the other users without
sacrificing ones own maxims and beliefs, hoping for a
usage time
similar help in the future. entertainment and
Sense of belonging and personal identification: Users aesthetic
rely on the virtual community, thinking themselves motivation
important to the community. Users sense their conformity
to the group. social connection
usage frequency
Intrinsic: users personal goals are in conformity to the motivation
group goals or are fit in the group goals.
Time: The average time per week of using the SNS altruism
website. motivation
Frequency: The average times per week of using the
SNS website. usage level
ascription and
Level: Based on the previous researches, four levels of
identification
involvement into the SNS websites are defined: motivation
Surfer: Few connections with the other members,
and few contributions to the virtual community. intrinsic
Features: no specific purposes, random surfing, self- motivation
enjoyment or passive information receiver. Figure 1. Research Model
Socialiser: have certain social connection within the
community and occasional contributions to the Hypothesis H2: Members usage has a positive
community Features: occasional attention to the correlation with their usage frequency.
other members, occasional participation of the H2.2: The stronger the members entertainment and
discussions and activities started by others. aesthetic motivation, the higher their usage frequency;
Contributors: strong social connections with the H2.1: The stronger the members information and
virtual community, passionate for the activities of instrument motivation, the higher their usage frequency;
the virtual community, and frequent contributions H2.3: The stronger the members social connection
for the community. Features: passionate attentions motivation, the higher their usage frequency;
to the other members, frequent participations of the H2.4: The stronger the members altruism motivation,
discussions and activities started by others. the longer their usage frequency;
Insider: extremely strong social connection and H2.5: The stronger the members ascription and
personal relation with the community, extremely identification motivation, the higher their usage
active contributor of the community. Features: frequency n, the longer their usage frequency;
besides attention to individuals, also participate in H2.6: The stronger the members intrinsic motivation,
group activities, and be willing to start a discussion the higher their usage frequency;
or organize activities in order to attract the other Hypothesis H3: Members motivation has a positive
members, like an intermediary between the correlation with their usage level.
community members and community operators. H3.1: The stronger the members information and
B. Research Hypothesis instrument motivation, the higher their usage level;
H3.2: The stronger the members entertainment and
Based on the above study, the research hypotheses are
aesthetic motivation, the higher their usage level;
established to probe into what is the relationship between
H3.3: The stronger the members social connection
SNS users motivation and their behavior (usage time,
motivation, the higher their usage level;
usage frequency and usage level), and whether they are
correlated with each other.

2012 ACADEMY PUBLISHER


1268 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

H3.4: The stronger the members altruism motivation, which bachelor degree accounts for 43.4%, while master
the longer their usage level; degree accounts for 49.1%; among the subjects, 55.3%
H3.5: The stronger the members ascription and are intrinsic students.
identification motivation, the higher their usage level;
B. Descriptive Statistical Analyze
H3.6: The stronger the members intrinsic motivation,
the higher their usage level;
TABLE I.
DESCRIPTIVE E STATISTICS OF EACH MOTIVATION VARIABLE
IV. RESEARCH METHOD
Independent Item
Min Max Aver SD
Variable s
A. Research Tool information
The research adopts questionnaire as its main research and
3 1.00 6.00 5.63 0.130
material. Questionnaire One focuses on the SNS users instrument
motivation
six motivations, and it measures users motivation with entertainment
the six-point scale of Likert Scale, in which 1 to 6 and aesthetic 4 1.00 6.00 4.72 0.162
represent completely disagree, basically disagree, motivation
partially disagree, partially agree, basically agree, Social
connection 7 1.00 6.00 4.60 0.212
completely agree respectively. For example: In this motivation
community, I can gain some useful information and data. altruism
Questionnaire Two measures SNS users behavior, 4 1.00 6.00 4.32 0.077
motivation
including their usage frequency, usage time and usage ascription and
level. Questionnaire Three surveys SNS users basic identification 6 1.00 6.00 4.03 0.054
motivation
information, including their gender, age, marital status, intrinsic
education level, after-tax monthly income and whether 2 1.00 6.00 3.87 0.014
motivation
they are intrinsic students or not.
B. Research Process Table I shows the results of the descriptive statistical
The questionnaires were delivered from 1st Nov. 2010 analysis of the six motivation variables in the model,
to 15th Dec. 2010 to subjects who are SNS network users including the minimum and the maximum, average and
through site link, email and papery questionnaires. standard deviation.
Altogether 230 questionnaires have been collected, in It can be seen that information and instrument
which 225 are valid. motivation is the strongest one. Intrinsic motivation is the
only one the average of which is below 4, so most people
C. Research Method tend not to have this motivation.
The study adopts data analysis as the research method. Judged from the survey, the SNS network users whose
(1) Descriptive analyze: Make statistic on average, registration time are below 1 year, 1-2 years, 2-3years,
standard deviation and so on of each variable, so as to and more than 3 years account for 8.0%, 22.1%, 32.7%,
describe SNS network users motivation and behavior. and 37.2% respectively, so more than half of the subjects
(2) Reliability analyze: Before a further data analysis, have been using SNS network for more than two years.
reliability analysis should be conducted first of the In this survey, more than half of the members use
variable measures in the research models and hypotheses. social websites less than 9 hours every week, 21.2% 10-
Reliability can measure the reliability, uniformity and 19 hours, 15.5% more than 20 hours. 51.8% members log
stability of the questionnaire. in social websites more than 6 times every week, so the
(3) Correlation analyze: It is a statistical method to usage level is very high and the users even highly depend
research the correlative degree among variables. The on the social websites.
paper focuses on the correlation analysis of users each According to this survey, most of the users are browsers,
motivation and behavior. accounting for 64.2%, 19.5% socializers, and 11.5%
contributors. In this survey, usage level of SNS social
V. RESEARCH RESULTS websites is not high.

A. Sample Composition C. Reliability Analyze


In this survey, there are 225 valid samples, in which In the research, SPSS software is used to test the
male account for 56.2%, while female account for 43.8%; questionnaires reliability. Data analysis shows that
the age mainly ranges from 21 to 30, in which 21-25 Alpha coefficient of the whole questionnaire is 0.942 and
account for 63.7%, while 26-30 account for 27.4%; that of each usage behavior is 0.934 and 0.612
86.7% of the subjects are unmarried; education level respectively, which indicates a high reliability.
mainly covers bachelor degree and master degree, in

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1269

H2.1 is not proved: The stronger the members


information and instrument motivation, the higher their
D. Correlation Analyze
usage frequency;
The research adopts correlation analysis to study the H2.6 is not proved: The stronger the members
correlation between motivation and behavior. See Table 4 intrinsic motivation, the higher their usage frequency;
for its results. H2.4 is not proved: The stronger the members
Based on the correlation analysis, the following altruism motivation, the longer their usage time;
hypotheses are proved.
H1.1 is proved: The stronger the members VI. RESEARCH DISCUSSION
information and instrument motivation, the longer their
usage time; It can be judged from the above results that social
H1.2 is proved: The stronger the members network users usage time has positive correlation with
entertainment and aesthetic motivation, the longer their information and instrument motivation, entertainment and
usage time; aesthetic motivation, social connection motivation,
H1.3 is proved: The stronger the members social altruism motivation, ascription and identification
connection motivation, the longer their usage time; motivation, and intrinsic motivation, which indicates that
H1.4 is proved: The stronger the members altruism members must spend much time on SNS websites if they
motivation, the longer their usage time; want to browse or gain information on SNS websites, to
H1.5 is proved: The stronger the members ascription release themselves through entertainment games, to
and identification motivation, the longer their usage time; establish contact with others, or to do favors for others;
H1.6 is proved: The stronger the members intrinsic
motivation, the longer their usage time; TABLE III.
RESULTS OF RELIABILITY ANALYZE
H2.2 is proved: The stronger the members
entertainment and aesthetic motivation, the higher their
usage frequency; Freque Usage
H2.3 is proved: The stronger the members social Motivation and Behavior Time
ncy Level
connection motivation, the higher their usage frequency;
Information Pearson 0.165* 0.378*
H2.5 is proved: The stronger the members ascription and correlation
0.064
*
and identification motivation, the higher their usage Instrument
Sig. 0.013 0.336 0.000
frequency n, the longer their usage time; motivation
H3.1 is proved: The stronger the members Entertainmen Pearson 0.369* 0.189* 0.304*
t and correlation * * *
information and instrument motivation, the higher their Aesthetic
usage level; Sig. 0.000 0.005 0.000
Motivation
Pearson 0.381* 0.546*
Social 0.136*
correlation * *
TABLE II. Connection
RESULTS OF RELIABILITY ANALYZE Motivation significance 0.000 0.041 0.000
Dimensions Cronbachs Alpha Pearson 0.243*
0.115
0.375*
Information and Instrument .642 Altruism correlation * *
Motivation Motivation
Sig. 0.000 0.085 0.000
Entertainment and Aesthetic .570
Motivation Ascription Pearson 0.380* 0.179* 0.512*
Social Motivation .835 and correlation * * *
Ascription and Identification .842 Identification
Sig. 0.000 0.007 0.000
Motivation Motivation
Altruism Motivation .748 Pearson 0.247* 0.348*
0.027
Intrinsic Motivation .877 Intrisic correlation * *
Motivation
Sig. 0.000 0.692 0.000
*. Significant correlation exists in (both sides) of the 0.05 level. **.
H3.2 is proved: The stronger the members Significant correlation exists in (both sides) of the 0.01 level.
entertainment and aesthetic motivation, the higher their
usage level; moreover, only by regularly participation in a certain
H3.3 is proved: The stronger the members social SNS social network can the SNS members feel they
connection motivation, the higher their usage level; belong to the community; similarly, if one confirms his
H3.4 is proved: The stronger the members altruism own contribution, he will be inspired to spend more time
motivation, the higher their usage level; on SNS community.
H3.5 is proved: The stronger the members ascription Moreover, social network members usage frequency
and identification motivation, the higher their usage level; have a positive correlation with the entertainment and
H3.6 is proved: The stronger the members intrinsic aesthetic motivation, which indicates that members must
motivation, the higher their usage level; frequently pay attention to SNS community to take part in
Based on the correlation analysis, the following SNS games, for example, in the stealing vegetables
hypotheses are not proved. game, members have to keep eyes on planting
vegetables and stealing vegetables by frequently logging

2012 ACADEMY PUBLISHER


1270 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

in the community. Similarly, if one confirms his own control tremendous substantial and detailed information,
contribution, he will be inspired to spend more time on which may cause hidden dangers to information security.
SNS community. To be legal operators, they are required to protect the
users privacy well.
information and
instrument C. Lead public opinion in a correct direction
motivation As a platform for sharing, spreading and gaining
information, SNS is characterized with publicity,
usage time accessibility and integration. Public opinion generally
entertainment and
aesthetic
reflects publics faiths, attitudes, ideas and emotions
motivation towards various phenomenon and problems in the real
society. Rationality is often mixed with irrationality in
social connection public opinion. Given that Network has great influence,
usage frequency
motivation bribing and manipulating of Internet public opinion has
become one way of vicious competition of domestic
commerce and other fields. In such case, SNS users and
altruism
motivation
the society will suffer terrible negative effects. Therefore,
operators are responsible to lead the public opinion in a
correct direction.
usage level
ascription and
identification D. Implications on marketing
motivation
First, the research can help SNS operators to develop
unique services.
intrinsic
motivation SNS social network operators should recognize that
people use SNS websites for different purposes, which
Figure 2. Research Result will change over time. Therefore, they should design the
services of SNS social websites on consideration of
At last, the usage level of social websites has positive members different motivations and their needs.
correlation with the six motivations, which indicates that Second, the research can help SNS websites to create
the stronger the motivation, the higher degree the E-business model.SNS websites can access to each
members use and concern on the social websites, that is members profile from the registration information, and
to say, the higher degree they act as browsers, socializers, find the users interests, hobbies, experiences, preference
contributors and intrinsic persons. and relationship circle through behavior analysis. Match
users interests and hobbies with their consuming
VII. RESEARCH APPLICATION behavior exactly, will help their market decision-making.
Based on the previous studies of network social
community, the research discusses SNS members VIII. LIMITATION OF STUDY AND FUTURE STUDY
motivation and behavior in specific SNS network The conclusion of the study was limited by the amount
environment and it has some significance in practice: of information and data discovered in the documents,
A. Strengthen network regulation reports, and studies comprising the literature review. In
addition, the use of survey only to collect the data was a
From the above survey, it can be found that SNS users
limitation because the data cannot be triangulated.
spend much time on SNS websites, with 51.8% netizens
Babbie(2004) describes triangulation as the use of
logging in social websites more than six times every
several different research methods to test the same
week, which indicates that social websites are highly
findings. Triangulation allows stronger support for the
attractive to netizens. As a double-edged sword, the
presence of a relationship [15]. However, similar
Internet offers both spaces for free expression and ways
limitation inhibits the validation of findings of any study
to spread vicious information. Websites fidelity and
or research project.
information disseminations reliability are improved with
Second limitation was the sample. The author just used
the emerging of SNS websites, which once release
college students as the study objectives. If the results of
vicious information, will results in some bad
this research want to be generalized, there will be more
consequences because the vicious information will be
similar researches with different objectives.
spread quickly by users who trust the websites. Therefore,
Another limitation was the use of the Likert scale. As
it is important to regulate SNS social websites as quickly
Gill and Jonson (2002) noted, participants may or may
as possible.
not give an accurate assessment of their beliefs, feelings,
B. Protect and regulate SNS social network members attitudes or behaviors [16]. Rather, they may answer
privacy. according to what they feel the correct response should be,
Many SNS websites are based on real-name not how they really feel, or may respond by always
registration, regulatory agencies should require the SNS marking the most neutral possible answer. Thus, the data
network operators discipline themselves. SNS operators

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1271

is legitimate only to the extent that participants are extant knowledge. Journal of the Academy of Marketing
completely honest. Science, vol. 30(2002), p. 362-375.
Based on the findings, conclusions and limitation of [13] Parasuraman, A., Zeithaml, V.A., & Malhotra, A. E-S-
this study, recommendations are presented in this section. QUAL: A multiple-items scale for assessing electronic
service quality. Journal of Service Research, Vol. 7(2005),
First, the research methods can be developed. Lab p. 213-233.
research can be introduced into the research. Because the [14] Ulaga, W. & Eggert, A. Exploring the key dimensions of
field study has some factors which are difficult to control, relationship value and their impact on buyer-supplier
such as the feeling of objectives, the environment factors. relationships. American marketing Association Conference
In laboratory, those factors can be controlled effectively. proceedings, Vol. 13(2002), p. 411.
Second, the research objectives can be distributed, [15] McKnight, H.D. Choudhury, V. & Kacmar, C.
more distributed samples can give more generalized DeveloPing and Validating Trust Measures for e-
result, which should be much more useful to the Commerce: An Integrative Typology. Information Systems
development of the theories. Research, Vol. 13(2002), p. 334 359.
[16] Gefen, D. Reflections on the Dimensions of Trust and
ACKNOWLEDGMENT Trustworthiness Among Online Consumers. .ACM
SIGMIS DatabaseVol.33(2002), p. 38-53.
The authors wish to thank Chinese Universities [17] Jarvenpaa, S. L., tractinsky, N. & Vitale, M. Consumer
Scientific Fund and Beijing University of Posts and trust in an Internet store. Information Technology and
Telecommunications. It was with their financial support Management, Vol. 1 (2000), p. 45-71.
could the author do the survey and study more deeply. [18] Chen, L., Gillenson, M. L., & Sherrell, D. L. Enticing
online consumers: an extended technology acceptance
perspective. Information & Management, Vol. 39(2002), p.
REFERENCES 705-719.
[1] CNNIC. The 26th Statistics Report on Chinese Internet [19] Srinivasan, N, & Ratchford, B. T. An empirical test of a
Development situation. 2010.7. model external search for automobiles. Journal of
[2] Richard P. Bagozzi, Utpal M. Dholakia, Intentional social Consumer research. Vol. 18(1991), p. 233-242.
action in virtual communities , Journal of Interactive [20] Koiso-Kanttila, N. Time, attention, authenticity, and
Marketing, Volume 16, Issue 2, 2002, Pages 2-21. consumer benefits of the Web. Business Horizons, Vol. 48
[3] McKenna K Y A, and Bargh J A.Causes and consequences (2005), p. 63-70.
of social interaction on the internet: A conceptual [21] Wang, S.. Cue-based trust in an online shopping
framework[DB OL].Media Psychology, 1999, (1):249- environment: conceptualization and propositions. In T. A.
269. Suter, (Ed.), Marketing Advances in Pedagogy, Process
[4] Utpal M. Dholakia, Richard P. Bagozzi, Lisa Klein Pearo and Philosophy, Proceedings of the Annual Meeting of the
A social influence model of consumer participation in Society for Advances in Marketing (2001), 6-10,
network- and small-group-based virtual communities November, New Orleans, LA.
International Journal of Research in Marketing, Volume 21, [22] Crosby, L. A., Evans, K. R. & Cowles, D. Relationship
Issue 3, September 2004, Pages 241-263 quality in services selling: An interpersonal influence
[5] Hogg M A. The social psychology of group cohesiveness: perspective. Journal of Marketing, Vol. 54(1990), p. 8-81.
Form attraction to social identification [DB/OL].NY: NYU [23] Gefen, D., Karahanna, E. &Straub, D.W.Trust and TAM in
Press, 1992. online shopping: an Integrated model.MIS Quarterly,
[6] McKenna K Y A, and Bargh J A.Causes and consequences Vol.27 (2003), P.51-90.G.
of social interaction on the internet: A conceptual [24] Ba, S, & Pavlou, P. A. Evidence of the effect of trust
framework[DB OL].Media Psychology, 1999, (1):249- building technology in electronic markets: Price premiums
and buyer behavior. MIS Quarterly, Vol. 26 (2002), p. 243-
269.
268.
[7] Ohbyung Kwon, Yixing Wen. An empirical study of the
[25] Babbie, E. The practice of social research (10th ed.).
factors affecting social network service use[DB OL].
Belmont, CA: Wadsworth Thomson Publishing (2004).
Computers in Human Behavior, 2010,26: 254-263 [26] Gill, J., & Johnson, P. Research methods for managers.
[8] Youcheng Wang, Daniel R. Fesenmaier. Towards London: Sage Publications (2002).
understanding members general participation in and active [27] tise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford:
contribution to an online travel community , Tourism Clarendon, 1892, pp.6873.
Management, Volume 25, Issue 6, December 2004, Pages [28] CNNIC. The 26th Statistics Report on Chinese Internet
709-722. Development situation. 2010.7.
[9] Kozinets V. E-tribalized Marketing?: The Srtategic [29] Richard P. Bagozzi, Utpal M. Dholakia, Intentional social
implications of Virtual Communities of Consumption action in virtual communities , Journal of Interactive
European Management Journal199917(3): 252-264. Marketing, Volume 16, Issue 2, 2002, Pages 2-21.
[10] McKenna K Y A, and Bargh J A.Causes and consequences [30] McKenna K Y A, and Bargh J A.Causes and consequences
of social interaction on the internet: A conceptual of social interaction on the internet: A conceptual
framework[DB OL].Media Psychology, 1999, (1):249- framework[DB OL].Media Psychology, 1999, (1):249-
269. 269.
[11] Zeithaml, V.A. Consumer perceptions of price, quality, and [31] Utpal M. Dholakia, Richard P. Bagozzi, Lisa Klein Pearo
value: A mean-end model and synthesis of evidence. A social influence model of consumer participation in
Journal of Marketing, Vol. 52(1988), p. 2-22. network- and small-group-based virtual communities
[12] Zeithaml, V.A., Parasuraman, A., & Malhotra, A. Service International Journal of Research in Marketing, Volume 21,
quality delivery through web sites: A critical review of Issue 3, September 2004, Pages 241-263

2012 ACADEMY PUBLISHER


1272 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[32] Hogg M A. The social psychology of group cohesiveness: [48] Heijden, H. V., Verhagen, T., and Creemers, T. (2001)
Form attraction to social identification [DB/OL].NY: NYU Predicting online purchase behavior: Replications and
Press, 1992. tests of competing models, Proceedings of 34th Hawaii
[33] McKenna K Y A, and Bargh J A.Causes and consequences International Conference of systems Science, Maui, HI.
of social interaction on the internet: A conceptual [49] Reichheld, F. F., and Schefterm P. (2000) E-loyalty: Your
framework[DB OL].Media Psychology, 1999, (1):249- secret weapon on the web, Harvard Business Review,
269. 78(4), pp. 105-113.
[34] Ohbyung Kwon, Yixing Wen. An empirical study of the [50] Pavlou, P. A., and Gefen, D. (2004) Building effective
factors affecting social network service use[DB OL]. online marketplaces with institution-based trust,
Computers in Human Behavior, 2010,26: 254-263 Information Systems Research, 15(1), pp. 3759.
[35] Youcheng Wang, Daniel R. Fesenmaier. Towards [51] Gefen. D. (2002) Customer loyalty in e-commerce,
understanding members general participation in and active Journal of the Association for Information Systems, 3, pp.
contribution to an online travel community , Tourism 27-51
Management, Volume 25, Issue 6, December 2004, Pages [52] Luo, X. (2002) Trust production and privacy concerns on
709-722. the internet: A framework based on relationship marketing
[36] Kozinets V. E-tribalized Marketing?: The Srtategic and social exchange theory, Industrial Marketing
Management, 31(2), pp. 111-118
implications of Virtual Communities of Consumption
[53] Gefen, D. (2000) E-commerce: The role of familiarity and
European Management Journal199917(3): 252-264. trust, Qmega (28:6), pp. 725-737.
[37] McKenna K Y A, and Bargh J A.Causes and consequences [54] Jarvenpaa, S. L. and Tractinsky, N. (1999) Consumer
of social interaction on the internet: A conceptual trust in an internet store: A crosscultural validation,
framework[DB OL].Media Psychology, 1999, (1):249- Journal of Computer-Mediated Communication, 5(2),
269. [55] Jarvenpaa, S. L., Tractinsky, N., and Vitale, M. (2000)
[38] Li, N. and Zhang, P. (2002) Consumer online shopping Consumer trust in an internet store, Information
attitudes and behavior: An assessment of research, Eighth Technology and Management, 1(1/2), pp. 45-71.
Americas Conference on Information Systems, 2002. [56] Pavlou, P. A., and Fygenson, M. (2006) Understanding
[39] Hsua, M. H., Yen, C. H., Chiu, C. M., and Chang, C. M. and predicting electronic commerce adoption: An
(2006) A longitudinal investigation of continued online extension of the theory of planned Behavior, MIS
shopping behavior: An extension of the theory of planned Quarterly, 30(1), pp. 115-143.
behavior, Int. J. Human-Computer Studies, 64, pp. 889 [57] Pavlou, P. A. (2002) What drivers electronic commerce?
904. A Theory of planned behavior respective, Academy of
[40] Stewart, D. W., and Pavlou, P. A. (2002) Substitution and Management Proceedings 2002 OCIS: A1-A6.
complementary: Measuring the effectiveness of interactive [58] Bailey, J. P., and Bakos, J. Y. (1997) Reducing buyer
marketing communications, Journal of the Academy of search costs: Implications for electronic marketplaces,
Marketing Science, 30(4), pp. 376-396. Management Science, 43(12), pp. 16761692.
[41] Pavlou, P. A., and Fygenson, M. (2006) Understanding [59] Ba, S., and Pavlou, P. A. (2002) Evidence of the effect of
and predicting electronic commerce adoption: An trust in electronic markets:
extension of the theory of planned Behavior, MIS
Quarterly, 30(1), pp. 115-143.
[42] Jarvenpaa, S. L. and Todd, P. A. (1997) Consumer
reactions to electronic shopping on the World Wide Web,
International Journal of Electronic Commerce, 2, pp. 59-88,
1997.
[43] Zhou, L., Dai, L., and Zhang, D. (2007) Online shopping Chen Hui
acceptance model - A critical survey of consumer factors ShanXi Province, Mar, 26, 1970. Doctor Degree, Majored in
in online shopping, Journal of Electronic Commerce Applied Psychology, in Beijing Normal University, Beijing,
Research, 8(1), pp. 41-62. China, in 2004. Master Degree, Majored in Communication
[44] Ajzen, I. (1991) The theory of planned behavior, Management, in Coventry University, Coventry, U.K., in 2006.
Organizational Behavior and Human Decision Processes, Master Degree, Majored in Applied Psychology, in Beijing
50, pp. 179211. Normal University, Beijing, China, in 2001. Bachelor Degree,
[45] Limayem, M., Khalifa, M., and Frini, A. (2000) What Majored in Fundamental Psychology, in Beijing Normal
makes consumers buy from internet? A longitudinal study University, Beijing, China, in 1991.
of online shopping, IEEE Transactions on System, Man, As an associate professor, she works in School of
and Cybernetics-Part A: Systems and Humans, 30(4), Management and Economy, Beijing University of Posts and
pp.421-432 telecommunications, Beijing, China. The published books
[46] Cho, V. (2006) A study of the roles of trusts and risks in include: Leadership in China, Beijing, Press of Beijing
information-oriented online legal services using an University of Posts and telecommunications, 2005.
integrated model, Information & Management, 43, pp. Communication Consumer Behavior, Beijing, Press of People
502520. Communications, 2010. Current and previous research interests
[47] Battacherjee, A. (2002) Individual trust in online firms: are Industrial and Organizational Psychology, Consumer
Scale development and initial test, Journal of Behavior, Human Recourse Management.
Management Information Systems, 19(1), pp. 211-241.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1273

The Load Forecasting Model Based on Bayes-


GRNN
Yanmei Li
School of Business and Administration, North China Electric Power University, Baoding 071003, China
Email: liyanmei28@yahoo.com.cn

Jingmin Wang
School of Business and Administration, North China Electric Power University, Baoding 071003, China

AbstractComparison with the classical BP neural network, information , date type , week type etc., these different ,
the generalized regression neural network requires not are difficult to ascertain that because of field time factor
periodic training process but a smoothing parameter. The reason, self fit in with chooses the all unable entering
model has steady and fast speed, and meanwhile, the characteristic real time field forecasting method therefore,
connection weight of different neurons is not necessary to be
various but can only be based on experience in advance
adjusted in the training process. The paper establishes the
index system of GRNN forecasting model, and then uses selected. Bayes decision theory concept the main body of
Bayes theory to reduce them, which will be inputting a book is made use of and method entering characteristic
variables of GRNN model. The method is testified to get in choosing short-term load forecasting.
higher speed and accuracy by simulation of actual data and
comparison to classical BP neural network. II. BAYES THEORY
Index TermsBayes, load forecasting, generalized The good according to that Bayes viewpoint , essential
regression neural network points work out one decision-making , the information
ought to make use of possessions to be able to gain,
include sample book information and all information first
I. INTRODUCTION in sampling and come from experience , consciousness ,
the subjective knowledge judging, these subjective
With the introduction of the concept of smart grid, a
knowledge same be valuable knowledge wealth, the
large number of distributed power grid to run, it have
middle responding to the formal introduction arrive at the
brought new challenges to the distribution network
deduction counting and making policy goes to, but this
planning, construction and operations. Since a large
exactly is that classics statistics has not given think. In
number of users install DER to provide electricity,
classical statistics deduction, only, admit and make use of
making the distribution network planners difficult to
sample information, but non-recognition, or make use of
accurately predict the load growth, thereby affecting the
the subjective judgment and consciousness. Bayes theory
plans rationality. Its been a long time that many electric
exactly is to hope that the judgment and intuition lead
operators commit themselves to the investigation of
subjectivity into the basis arriving at the analysis process
electric system load forecasting technique and have
counting deduction and building information thereby in
obtain some production to a certainty, for example time
decision analysis, inferring and making policy
sequence, regression analysis, gray theory, artificial
synthetically formally.
neural network and so on. The artificial neural network
Bayes decision theory concept and method are used for
has powerful collateral disposal mechanism, imminent
fields such as engineering, management science already
capability of arbitrary function, learning capability and
by broad field, adaptively selects input features step with
self-organization and self-adaptation capability. And it is
Bayes method as follows:.
capable of considering the impact of variable factors such
as weather, temperature and so on. So it has been widely A Ascertains a priori probability distribution.
applied in the field of load forecasting and decision-
making. Comparison with the classical BP neural The priori probability

P j
represents the estimation
network, the generalized regression neural network to the probability distribution of the variable j , it has
requires not periodic training process but a smoothing reflected a priori knowledge to the variable, has included
parameter. The model has steady and high speed, and practical experience and subjective judgment etc. And
meanwhile, the connection weights of different neurons that a priori probability scatters in load forecasting , is
are not necessary to be adjusted in the training process. what be remained to be chosen influencing factor and
And since change factor affecting load is numerous, if their probability aggregation , restricting condition is
history load , temperature information , weather whose necessary probability greater than zero, whose

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1273-1280
1274 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

combination is 1. The load at the same time, forecasting Among them, a priori probability and likelihood
(namely "big close small distant" distance according to function action input vectors, a posteriori probability is
load forecasting nearest segment of period affects output vector. Because a posteriori probability has
maximal) and "identical date load type similarity" synthesized a priori knowledge and sample book
characteristic, the priori probability distribution is information, ultimate being to decide which group of
influencing factors to choose being input vector is
Pper P dis standard. A posteriori probability is increasingly big, the
probability that input vector is pitched on is increasingly
P equ Pdiv big.

P per Pequ D Chooses the input vector j .

P P P P 1
per dis equ div
According to the actual characteristic being unlike area
load, choose M may affect bigger factor composition to
Pper 0, Pdis 0, Pequ 0, Pdiv 0 load waiting for choosing influencing factor, again out of,
the random chooses different influencing factor

P P combination , forms input vector j .


In the formula, per and dis represent the distance Owing to Bayes theory, the logic block diagram of
from influencing factor of forecasted point far or close; input variable adaptively selected will be show:
P
and respectively, equ and Pdiv represent the priori
probability of influencing factor that having identical or According to the forecast, to select M influencing factors to be
different date type; Take , , value range being 1, , influencing factors set candidates
may look at concrete conditions but fix.
When 1, 1, 1 , the priori probability distribution
is a uniform one. To determine the prior probability of the influencing factors
according to the prior probability distribution formula
B Ascertains a likelihood function.
The likelihood function has been a condition
probability essentially it has reflected sample information,
whose function value has been called likelihood rates, has Randomly to select a group of influencing factors to be an input characteristic vector

been.

P x 1 , x 2 , , x N j . During the period of load
j
forecasting, if the input variable is selected as , we may To calculate effective verification error and
make use of N samples x n (n 1,, N ) to calculate the likelihood function Ei
Error E according to the follow formula.


1 n
Rvalid _ j f train
To calculate the posterior
Ei P x , x ,, x
probability P x 1 , x 2 ,1 , x2 and writes down
n j 1 N

Rvalid _ j f train
1 y 100%
f train x j
N valid
j

yj
x j Z valid If every influencing factor has been selected

Rvalid _ j f train
is relative mean error in style, from To compare which group of posterior
f
training collection to get regression function train probability of influencing factors is bigger, then
select them to be input vector X
relativity on effective set N valid be an effective set
Z valid all together sample book number; y j be that actual


load value; f train x j be forecasting load value
To forecast

C Calculates posterior probability. The formula of Figure 1. Logic diagram of Bayes theory
posterior probability


III. GRNN NEURAL NETWORK
P j x 1 , x 2 , , x N GRNN network is a new type of neural network
proposed by Donald Specht in The Lockheed Paio Alto
Research Laboratory. It is built on the basis of
mathematical statistics, which can approximate intimate

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1275

relations according to the sample data and mainly be used m


for system model and forecasting. It is different from S nj Yij Pi
traditional neural network. The model has a clear j 1 j 1,2 , l
theoretical basis and it is a neural network established on The number of Neurons in the output layer is equal to
the basis of mathematical statistics. learning sample dimension L in the output vector and the
The advantage is fast learning. The network finally summation layer neuron output will be divided, that is:
converges to the most optimal cluster sample regression y J S NJ / S D
plane. Once the learning samples are determined, the
corresponding network structure and connection weights J 1,2, , l
between neurons also will be determined. Network
training process is actually the process of determining
smoothing parameter. Even scarcer in the sample data,
the effect is also very good, the network can handle data
uncertainty. General regression neural network is
characterized by a few manual adjustments of the
parameters, which only need to adjust parameter
of .The learning of the network all relays on data
samples, and it is fast to learn. The characteristics of the
decision avoided the network to maximize the subjective
impact of assumptions on the predicted results. It is an
ideal means and tools for surface fitting and modeling.
As shown in Figure 2, GRNN structure has four
neurons, including the input layer, pattern layer,
summation layer and output layer. Corresponding
X X 1 , X 2 , X m and the output
T Figure 2. Structure of GRNN.
network input is
Y Y1 , Y2 ,, Yl
T Generalized regression neural network theory is based
is on nonlinear (nuclear) regression analysis. Set the joint
The number of neurons is equal to the number of probability density function of random vector x and
training samples m in the input layer. The distribution of
random variable y as
f x, y , x values as x0 , the
neuronal cells is simple, directly passing input variables
to hidden layer. return value of y on x0 is:
The number of neurons is equal to the number of

yf x0 , y dy
training samples n in the pattern layer, the neurons

y x
corresponding to different samples. The transfer function

of neurons i in pattern layer is

f x0 , y dy
0

T

Pi exp x xi x xi / 2 2 i 1,2, , n . (1)
Use Parzen nonparametric estimation; estimate the
X is the network input. Xi is study sample density function with the sample data set
corresponded to neuron i. is smoothing parameter, that
is, neuron i's output is exponential of Euclid square xi , yi imn 1 according to Eq. (1):
between the input variable X and corresponding sample n
f x0 , y
1
Xi. p 1 e d x0 , xi d y , yi
e
D X X i X X i
2 T
n2 2 1 2 p y i 1

Summation layer consists of two types of neurons. One


neuron model for all the output neurons arithmetic sum (2)
and the pattern layers neurons and neuronal connections
of the right value is 1.The transfer function is Among it,

d x0, xi x0 j xij / j , d y, yi y yi
n n
S D pi
2

i 1 i 1,2, , l j 1
The other neuron model is for all the output neurons Wherein, n is the sample size; p is the dimension
of x ; is the width coefficient of Gaussian function,
weighted sum. Neuron i in the mode layer sum neuron j
in the summation layer .Connection weights between
called the smoothing parameter here. Substitute Eq.(2) in
neurons is the first j elements Yij in the output of the first
Eq.(1), and exchange the order of integral and additive,
i samples Yi The transfer function of summation neurons
then:
is

2012 ACADEMY PUBLISHER


1276 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

n (E) Mutation operation, temporary variation individual


e d x0 , xi ye d y , yi dy

i 1

iG 1is defined in Eq.(5), wherein, 1j, k, lN, and i, j,
y x0 n
k, l differ from each other, F will control the degree of
e d x0 , xi e d y , yi dy

i 1

variation of differential item
bG iG kG lG
(3)
against
iG .

ze x dz 0
y

As , calculation result two integrals of


iG 1 iG F ( bG iG kG lG ) (5)
Eq.(3)
(F) Hybridization operation: according to Eq.(6),
d x0 , xi
ye i create a hybrid between i G 1
and this generation

y x0
individuals to generate the next generation individuals,
G
d x0 , xi
n

e
wherein, lj is the j-th gene of the i-th individual of the
G-th generation, C is the hybridization rate, random
i 1 number originates from [0, 1] uniformly.
(4)

y x0 in Eq.(4) ijG Random number C


It can be seen that the prediction value G 1
G 1
ij
ij
is equal to weighted sum of the value of dependent others
y
variables i of all samples. (6)
(G) Selecting operation: according to Eq.(7), filial
A Smoothing parameter optimization generation individuals compete with the parent to choose
In the training process, learning algorithm of the next generation individuals.
Generalized regression neural network adjust the
smoothing parameter instead of adjusting the iG 1 P( iG 1 ) P( iG )
connection weights between neurons, to adjust the G 1
G
i
i
transfer function of each unit in model layer so as to get others
(7)
the best regression estimation.
(1) Estimate smoothing parameter with ordinary (H) G+1G.
differential evolution algorithm (DE) (I) If G exceeds the maximum evolving algebra M, or
Parameter estimation is to solve the value of when if the best fitness value difference between the G-th
2


d x0 , xi x0 j xij / w j
p generation and the G+k-th generation is not greater than
esp., go to step (J); otherwise return to Step (C). Where k
j 1
or is non-negative integer, which can be set by the user
8
d y, yi y yi is minimum value. according to accuracy requirement, esp. = 10
2
.
G
(J) Take individual bwith the best fitness value as
DE method steps: firstly determine the domain of
as , and width range of the i-th component as
hi , parameter estimation value in the last generation
population.
take as individual,
q as fitness function, then
perform the following steps:

B Estimate the smoothing parameter with the
(A) Select population size N, Weighting factor F= [0, modified differential evolution algorithm (MDE)
2], maximum evolving algebra M and hybridization rate a) Maintain the diversity of species
C [0, 1]. Inbreeding will lead to degradation, its easy to make
(B) Generate initial population 0 :{
i0 (i=1
an individual approach to a local optimum if the
evolution shrinking too fast, and resulting in inbreeding
2, ,N)}, set evolving algebra G=0.
of the next generation. For maintaining the diversity of
P( iG ) species, this paper reserves the best individual and
(C) Calculate the fitness of each individual
initializes another individual, which is called resetting
G
and the best individual b of G-generation. operation, and designs the parameters to measure
distribution range of offspring, as Eq.(8) shows.
(D) Perform step (E) to step (G) with i (i=1, 2, ,
G
N) to generate the G+1-th generation population.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1277

( G
ij iG ) (J) Resetting operation: calculate G according to step
G (H), if min , then implement resetting operation.
j G

i hi N (8) (K) Change the optimization rate according to Eq.(9).


1 (L) If G exceeds the maximum evolving algebra M, or
iG

N

j
G
ij if the best fitness value difference between the G-th
Among it, . generation and the G+k-th generation is not greater than

When is less than the lower limit min , which is


G esp., go to step (M), wherein, k, eps. and DE are the same.

Gj
calling resetting operation, what will retain the best (M) Take the best individual b as parameter
G
individual b and generate a number of individuals in
estimation value in the last generation population.
its surroundings according to the normal distribution. IV. CASE ANALYSIS
b) Design optimization operation
Optimization operation is introduced to use of In this paper, Southern Hebei Network of the 2009
evolutionary information, to implement deterministic data were analyzed to September 10, 2009 to September
optimization timely according to the trend of fitness 19 points for the entire sample for the study and active
function. Simplex method is a good optimization load to September 20 for the entire load for the test
operation, which has excellent search ability, no need to samples were forecast in September on the 21st load. 52
calculate derivative, and is easy to implement. Design selected by experience on the condition variable attributes,
simplex optimization operation, which is optimizing from which is 12 load data, that is, on the 10th to the 19th day
an individual with simplex method, and calling it the whole point of load; The remaining 40 non-load data,
according to variable frequency (known as optimization including weather, the date type, sunshine duration,
rate). Reduce optimization rate when the rate of shrinkage maximum temperature, minimum temperature, average
is fast; otherwise improve. Therefore, compare temperature, the biggest humidity, humidity, such as
distribution range of two generations population, change minimum 40 factors, including the date and type of rest
the optimization rate according to Eq.(9), where the days (including weekends and the statutory rest), the
weather conditions on the provision of meteorological
parameters are: s 1 s 2 0. information is divided into 17 types. The neural network
input vector, using the above-mentioned were all rough
Px 0.05(1 Px ) (G / G 1 ) s1 intensive SR algorithm primaries to the impact of load
reduction factors. Table I lists a variety of reduction
PX Px 0.05 Px s1 (G / G 1 ) s 2 algorithm results. Can be seen, after attribute reduction,
Px others the input vector be simplified.

9
TABLE.I
c) Steps of MDE RESULTS AFTER BAYES REDUCTION METHOD
set optimization target as the fitness function P(),
forward six implementing steps of MDE is basically the
same with DE, only at the A-th step lower limit
min
and initial value
p
X of optimization rate should be
selected, the cycle range of the D-th step is from step(E)
to step(H), and the G-th step is as follows:
(G) Simplex optimization operation: random number r

originates from [0, 1] uniformly, if r>p x , then go to step

(H), or implement this step: take


iG and two randomly
TABLE. I has carried out a form from the function
forecasting accuracy and training time the forecast to Tab.
G G II kinds method in two aspects comparing, has forecast
selected individuals j and k to form the initial type
accuracy among them adopt average proportional
of optimization with simplex optimization to get the best
( ) to be the analytical index, whose definition as
G 1
individual ib and replace i . Design simplex on
G error

two-dimensional subspace in the domain to ensure that is PAi PFi 100


convex.
(H) Selecting operation: its the same with the G-th Pn

step of DE. n
(I) G+1G.
follows: Among them,

2012 ACADEMY PUBLISHER


1278 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

PAi be to forecast value, PFi be actual value, n is to Normalized the value of input and output variables to
[0, 1]. Regulation is one of the many ways here by the
forecast number of times. following formula:
We can see from Tab.1, making use of Bayes reduction,
the input variables reduces from 43 to 31, calculating
time reduces to 13min, the average relative error decrease x xmin
from 11.11% to 5.82%, not only the attribution was x
reduced in the maximal degree, but the computing time is
shortest relatively, and the calculation error is minimum.
xmax xmin
It is shown from the training time and accuracy. The
method in the paper is better than traditional BP Neural
Network, so it is suitable for the forecasting samples in
this area.

Comparision of actual data and forecasting data

1800
1600
1400
1200 acttual data
FMAX /MW

1000
800 forecasting
600 data
400
200
0
1 4 7 10 13 16 19 22 25 28

Figure 3. Comparison of actual data and forecasting data of max load everyday

The research was supported by the Fundamental


Research Funds for the Central Universities, Fund No:
11MR62.

TABLEI REFERENCES
COMPARISON RESULTS OF DIFFERENT METHOD
[1] GU Zhihong, NIU Dongxiao, WANG Huiqing,
Research on application of general regression neural
network in short-term load forecasting, ELECTRIC
POWER, vol. 39, no. 4, pp. 11-14, 2006.
[2] PAN Rui, LIU Jun-yong, NI Ya-qi, Ultra-short term
forecasting for short circuit current based on load
forecasting and general regression neural network, Power
System Protection and Control, vol. 38, no. 18, pp. 94-99,
2010.
[3] WU Yaohua, Monthly Load Forecasting Based on General
CONCLUSION
Regression Neural Network, JOURNAL OF
This paper has been submitted one kind of short-term NORTHWEST HYDROELECTRIC POWER, vol. 23, no.
load forecasting method based on Bayes. On the premise 4, pp. 9-12, 2007.
of every factor data test result being indicated, being [4] LI Ruqi, YANG Licheng, MO Shixun, Short term electric
load forecasting based on accumulated weather and ACA-
method's turn to be able to think that the forecast is GRNN, RELAY, vol. 36, no. 4, pp. 58-62, 2008.
connected with load in synthesis, reach higher forecast [5] Lin Yushu, Li Jian, Zhao Cuocai. Application of Rough
accuracy, and be that one kind of effective short period Generalized Regression Neural Network in Heating Load
load forecasts method within shorter training time. Forecasting, Electric technology, no. 12, pp. 1-19, 2007.
[6] Chen Yaowu, Wang Leyu, Long Hongyu. Short-term load
ACKNOWLEDGMENT forecasting with modular neural networks. Proceedings of
the CSEE, vol. 21, no. 4, pp. 79-82, 2001.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1279

[7] Pan Feng, Cheng Haozhong, Yang Jingfei, et al, Power [18] Senjyu, T.; Takara, H.; Uezato, K.; Funabashi, T., One-
system short-term load forecasting based on SVM. Power hour-ahead load forecasting using neural network, Power
system Technology, vol. 28, no. 21, pp. 39~42, 2004. Systems, vol. 17, pp. 113 118, 2002.
[8] WU Yaohua, A new method of monthly load forecasting, [19] Wei-Min Li; Jian-Wei Liu; Jia-Jin Le; Xiang-Rong Wang,
Science & Technology Information, no. 30, pp. 71-72, The financial time series forecasting based on proposed
2007. ARMA-GRNN model, Machine Learning and Cybernetics,
[9] WU Yaohua, Study on M idand Long-term Power Load vol. 4, pp. 2005 2009, 2005.
Forecasting Based on Wavelet Soft-threshold Technology, [20] SUN Da-shuai, MA Li-xin, WANG Shou-zheng. Design of
GUANGDONG ELECTRIC POWER, vol. 20, no. 12, pp. short-term load forecast systems based on the theory of
5-8, 2007. complex systems, Journal of University of Shanghai For
[10] LIU Xueqin, WU Yaohua, CUI Baohua, Application of Science and Technology, 2011, 32(1): 39-43.
wavelet soft-threshold de-noising and GRNN in monthly [21] LIAO Li, XIN Jian-hua, ZHAI Hai-qing, WEI Zhen-hua.
load forecasting, Power System Protection and Control, vol. Short Term Load Forecasting Model and the Influencing
37, no. 14, pp. 59-62, 2009. Factors, JOURNAI OF SHANGHAI JIAOTONG
[11] GUO Bin, MENG Ling-qi, DU Yong, MA Sheng-biao, UNIVERSITY, 2004, 38(9): 1544-1547.
Thickness prediction of medium plate mill based on [22] ZHU Ji-ping, DAI Jun. Optimization Selection of
GRNN neural network, Journal of Central South Correlative Factors for Long -term Load Prediction of
University: Science and Technology, vol. 42, pp. 960-965, Electric Power, Computer Simulation, 2008, 25(5): 226-
2011. 229.
[12] ZHOU HaoZHENG Ligang. FAN Jianren. CEN Ke [23] WEI Lingyun, WU Jie, LIU Yongqiang. Long term electric
fa, Application of general regression neural network in load forecasting based on system dynamics, Automation of
prediction of coal ash fusion temperature, Journal of Electric Power Systems, 2000, 24 (16): 44-47.
Zhejiang university (Engineering Science), vol.38, no. 11, [24] XIAO Xian-yong, GE Jia, HE De-sheng. Combination
pp. 1479-1482, 2004. Method of Mid-Long Term Load Forecasting Based on
[13] [WU Yao-hua, Long term load forecasting based on GM- Support Vector Machine, Proceedings of the CSU-EPSA,
GRNN in power system, Relay, vol. 35, no. 6, PP. 45-48, 2008, 20(1): 84-88.
53, 2007. [25] LI Ru-qi, CHU Jin-sheng, XIE Lin-feng, WANG Zong-yao.
[14] YU Jian-ming, LI Meng, SHU Fei, Application of GRNN- Application of IAFSA-RBF Neural Network to Short-Term
algorithm on Load Modeling of Power System, Load Forecasting, Proceedings of the CSU-EPSA, 2011,
Proceedings of the CSU-EPSA, vol. 21, no. 1, pp. 104-107, 23(2): 142-146.
2009. [26] ZHOU Xiao-hua, et Al. Application of Elman Neural
[15] LU Ning, ZHOU Jian-zhong, HE Yao-yao, Particle swarm Network to Short-Term Load Forecasting in Rural Power
optimization-based neural network model for short-term System, Journal of Anhui Agricultural Sciences, 2011,
load forecasting, Relay, no. 12, pp. 65-68, 2010. 39(4): 2424-2426.
[16] LIU Xue-qin1, WU Yao-hua2, CUI Bao-hua1, Short-term [27] WANG Lin-chuan, BAI bo, YU Feng-zhen, YUAN Ming-
load forecasting model based on extended rough set, Relay, zhe. Short-term load forecasting based on weighted least
no. 5, pp. 25-28, 38, 2010. squares support vector machine within the Bayesian
[17] Osman, Z.H.; Awad, M.L.; Mahmoud, T.K., Neural evidence framework, Relay, 2011, 39(7): 44-49.
network based approach for short-term load forecasting, [28] ZHOU Ying, YIN Bang - de, REN Ling, BIAN Xue fen.
Power Systems Conference and Exposition, pp. 1-8, 2009. Study of Electricity Short- term Load Forecast Based on
BP Neural Network, Electrical Measurement &
Instrumentation, 2011, 48(2): 68-71.
[29] SHI Hai-bo. Power Load Forecasting Based on Principal Journal of Shanghai Maritime University, 2011, 32(1): 70-
Component Analysis and Support Vector Machine, 73.
Computer Simulation, 2010, 10: 279-282. [37] WEN Ai-hua, LI Song. Forecast of Railway Freight
[30] LI Xiao-bo, LUO Mei, FENG Kai. Comparison of neural Volumes Based on Generalized Regression Neural
network methods for short-term load forecasting, Relay, Network, Rail Way Transport and Economy, 2011, 33(2):
2007, 35(6): 49-53. 88-91.
[31] LI Ru-qi, YANG Li-cheng, MO Shi-xun, SU Yuan-yuan, [38] YU Ping-fu, LU Yu-ming, WEI Li-ping, LONG Wen-qing,
TANG Zhuo-zhen. Short-term electric load forecasting SU Xiao-bo. Application of General Regression Neural
based on accumulated weather and ACA-GRNN, Relay, Network GGRNN on Predicting Yield of Cassava,
2008, 36(4): 58-62. Southwest China Journal of Agricultural Sciences, 2009,
[32] CHEN Chen, et. Al. Agricultural Product Quality Mining 22(6): 1709-1713.
based on Bayesian Classification, Journal of Anhui [39] NIU Dong-xiao, QI Jian-xun. The Research of the Load
Agricultural Sciences, 2011, 39(12): 7448-7449. Forecasting Method of the Variable Structure Neural
[33] SHEN Jin-biao, LV Yue-jin. A reasoning and diagnosis Network Based 0n the Fuzzy Treatment, OPERATIONS
model based on rough sets and bayesian networks, 2009, RESEARCH AND M ANAGEM ENT SCIENCE, 2001,
34(6): 815-818. 10(2): 86-92.
[34] CAI Na, ZHANG Xue-feng. Attribution Reduction of [40] LI De-zhi, HE Yong-xiu, ZHANG Yu. Urban load forecast
Bayesian Rough Set Model, Computer Engineering, 2007, model based on land use change and load characteristics,
33(24): 46-48. Electric Power, 2011, 44(1): 6-10.
[35] HUANG Chong-zhen, LIANG Jing-guo. Predicting [41] LIN Hui, LIU Jing, HAO Zhi-feng, ZHU Feng-feng, WU
construction quality of a marine drilling platform based on Guang-chao. Short-term load forecasting for holidays
GRNN, Journal of Harbin Engineering University, 2009, based on the similar days load modification, Relay, 2010,
30(3): 339-343. 7: 47-51.
[36] ZHOU Shaolong, ZHOU Feng. GRNN model for [42] ZHANG Hong-xu , YAO Jian-gang , CAO Wei , MA Gui-
prediction of port cargo throughput based on time series, xia. Ultra-short Term Load Forecasting Based on Improved

2012 ACADEMY PUBLISHER


1280 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Gray Model, Proceedings of the CSU-EPSA, 2009, 21(6):


74-77.
[43] WANG Hong-wei, LIN Jian-liang, Sales Prediction Based
on Improved GRNN, Computer Engineering & Science, no.
1, pp. 153-155, 2010..

Yanmei Li, Lecturer and Master, works in School of


Business and Administration, North China Electric Power
University, Baoding, China. Her research interests inclue
Modeling and model Applications.

Jingmin Wang, professor, works in School of Business and


Administration, works in North China Electric Power
University, Baoding, China, whos research interests include
Modeling and model Applications.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1281

Research on the Model Consumption Behavior


and Social Networks Role of Digital Music
Dan Liu
School of Management and Economics, Beijing University of Posts and Telecommunications, Beijing 100876, China
Email: liudan@bupt.edu.cn

Tianchi Yang and Liang Tan


International School, Beijing University of Posts and Telecommunications, Beijing 02209, China
Email: {ee08b277, ee08b018}@bupt.edu.cn

AbstractIn order to better understand the psychosocial wouldnt pay 70% goods from traditional shelves that are
factors involved in consumption to online music, we examine useless. They just pay what they enjoy, select from
the role that two types of social networksadvising network internet. Whats more, internet provides more widely
and emotional networkplay in individual consumption to select range than traditional shelves. No matter what
digital music, especially the consumption to online music in
China. Using survey data from 154 college students in 3
volume zone, or niche products, people can easily find
universities, we find that the person who is more dependent from internet.
on social network may tend to download music from Up to now, more than 500 online music facilitators
internet and consultation network has more influence on provide online music service to at least 40 countries. One
downloading music from internet than friendship network online music store can accommodate four million singles.
as well as some interesting results. We conclude the paper However, one largest traditional music store can only
by discussing theoretical implications for the relevance of provide 15 million singles. The amazing capacity of
social network research for members adaptation to digital online music stores is incomparable to traditional ones.
music as well as outlining specific implications for practice. The development of online music store is rapid. From
Index Termsonline music, download behavior, social
the income scale of global current digital music market,
networks we find that in 2006 digital music revenue accounts for
only 8% of the total music market, however in 2010 it is
almost 30%.
I. INTRODUCTION IFPI digital music report (2007) points out the five
most important characteristics affecting online music
Digital music refers to the process of music producing, consumption:
dissemination and storage, which uses digital (1) Find the music which enjoy (Search or
technologies. Digital music products include songs, MV, recommend)
flash and so on. (2) Diversity selection (content, classification search)
According to the different transmission channels, (3) Music price
digital music is divided into two parts, online music and (4) Get free music
wireless music (also called mobile music). Online music (5) Audition before buying it
mainly refers to the digital music that audio format is
MP3, WMA or some kinds else, listened online or B. Mobile music
downloaded to broadcast equipments. It corresponds to Mobile music is a kind of music service by mobile
network market. Wireless music mainly refers to internet, including ringtones downloading; CRBT
ringtones, CRBT (coloring ring back tone), IVR bell and (coloring ring back tone), IVR bell, whole songs
whole songs downloaded to the phone (include downloading and wireless audition in the future.
WAP/MMS) that are via mobile communication network The development of Mobile music is rapid, and now it
and terminals to provide digital music. It corresponds to becomes the most attractive consumption of
mobile value-added market. entertainment economy. According to iResearch analysis
report, in 2009, the scale of China mobile music market
A. Online music
reached 30.87 billion Yuan which increased 9.8%
Online music refers to digital music which is listened comparing with the scale in 2008. With the improving of
online or downloaded directly to computer and other environmental factors like market scale, 3G commercial
broadcast equipments. process, China mobile music market will be increasing
The superiority of online music is obvious. People can stably in the future. It is expected that the scale of China
select what they enjoy and combine optionally. Online mobile music market will achieved 39.28 million Yuan.
music has seriously snatched market from traditional From 2002 to 2008, China mobile music industry has
retail stores. With the help of long tail theory, people experienced three stages, including beginning,

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1281-1288
1282 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

development, and maturity. In 2002, China mobile below that education level, are 8.2%, 7.7%, 3.0%
launched downloading phone rings service, and the respectively.
market enters in the beginning stage. In 2003, with the And there are some other phenomena:
CRBT launched, more and more customers use phone The search engine becomes the first choice for net
rings, CRBT, speech-on-demand and other mobile music citizens in China when they search music on Internet. The
services. Since 2005, music phone has become the users for on-line music attach much important on the
mainstream products. After 2008, with the 3G accuracy of the result from search engine. When search
commercial promotion and music phones popularization, music, users always pay much attention to the
single downloading and listening which are based on the advertisements on host page of the website. The search
3G network would be popular. engine is still the main way for users to acquire music
As a kind of fashionable value-added business, the user what they want. About 85.8% Chinese net citizen acquire
groups of CRBT have a large number of music lovers. music by making use of search engine; the proportion of
CRBT helps customers have a deeply comprehension to people, who log in professional music websites, are
digital music business, and accumulate potential users to 48.2%; that proportion reach 27.1% when people use
digital music business. With the commercial of 3G some Chinese gateway websites, such as Sina, Sohu.
network, China mobile, based on the extensive network
user groups and customers comprehensions to digital II. LITERATURE REVIEW
music business, extends to music downloading and other
Social influence occurs when a person adapts his or her
types of mobile music business.
behavior, attitudes, or beliefs to the behavior, attitudes, or
C. Online social network beliefs of others in the social system [5]. Social influence
The scale-free properties of online social network has been the subject of more than 70 marketing studies
topology provide digital music an ideal place to diffuse. since the 1960s. Overall, scholarly research on social and
Throughout a long period of social network studying on communication networks, opinion leadership, source
the algorithm of online social network structure, we have credibility, and diffusion of innovations has long
obtained some significant conclusion that made by demonstrated that consumers influence other consumers
academics, these mathematical achievements may throw [6]. Influence does not necessarily require face-to-face
light upon the mystery hiding behind the phenomenon of interaction but rather is based on information about other
online music diffusion, as well as mobile music. Initially, people [7]. In an online community, information is passed
a uniform probability to all random graphics has been among individual users in the form of digital content.
assigned to static scale-free network of N nodes with Here, we consider a particular type of social influence
degree k, as shown in Eq.(1). that takes place in an online communitynamely, when
members change their site usage in response to changes
N (k ) = e k (1) in the behavior of other members.
Though a relatively new area in marketing research,
As the development of online social network, social online communities have attracted the attention of many
network have evolving so much from static scale free scholars. Dholakia, Bagozzi, and Pearo [8] study two key
network to a more dynamic one. Eq.(2) shows the group-level determinants of virtual community
evolved BA model that correlation to age a of a node participation group norms and social identityand test
with k link degree. the proposed model using a survey-based study across
several virtual communities. Kozinets [9] develops a new
a a
N a (k ) = 1 (1 1 ) k (2)
approach to collecting and interpreting data obtained
N N from consumers discussions in online forums. Godes and
Mayzlin [10] and Chevalier and Mayzlin [11] examine
the effect of online word-of-mouth communications.
D. The basic attributes of Chinese users behaviors
Dellarocas [12] analyzes how the strategic manipulation
Distribution of gender: The number of male covers a of Internet opinion forums affects the payoffs to
large percentage in the number of on-line customers. The consumers and firms in markets of vertically
data collected by iResearch shows that the proportion of differentiated experience goods. Narayan and Yang [13]
male users is over 60%. study a popular online provider of comparison-shopping
Distribution of age: The proportion of customers, services, Epinions.com, and model the formation of
whose age range from 25 to30, covered the 42.6 %; the relationships of trust that consumers develop with other
customers, whose age range from 18 to 24, covered consumers whose online product reviews they
23.1%; the customers, whose age range from 31 to 35, consistently find to be valuable. Finally, Stephen and
covered 18.6%. More than 60% customers age range Toubia [14] examine a large online social commerce
from 18 to 30 and that proportion can exceed 80% when marketplace and study economic value implications of
customers age ranges from 18 to 35. link formation among sellers.
Distribution of education: More than half users own A new perspective on adoption may be necessary to
bachelor; 31.1% users graduated from junior college; the fully capture the nature of technology acceptance in
percentage of students in senior high school, people who social computing situations, where the technology is
owns master or PHD, and students in junior school or embraced rather than simply accepted by the user, and

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1283

where the action made possible by technology is seen as a and money. It often takes a very long time to find the
behavior embedded in society. A technology that was right time to pay back when you owe someone a favor.
originally intended to deliver subscriber information was Without the sense of trust, we cannot engage in the
adopted by end users as a vehicle for social behavior. As human exchange. Weak ties provide the channel of
the Internet, networking, and communications information and strong ties provide the foundation of
technologies become increasingly embraced by trust. In this paper, we will study in the weak and strong
individuals and embedded in everyday lives and ties analysis in relating to the social networks on the
activities, technologically enabled social structures are impact of online music downloads.
emerging that are changing the way individuals interact
and communicate, and are facilitating fundamental III. RESEARCH DESIGN
changes to business practices.
From the analysis of economic behavior, the economic A. Hyperthesis Development
behavior must be carried out in the middle of society and In view of the complexity of social network theory and
the human relationship must affect the economic action. its own limits of knowledge, we may start with the social
However the rational side of economic behavior has network theory in connection with the download of
rational side and non-rational side. The economic action online music. It is to verify the hypothesis in order to
of individuals is influenced by the trust and emotional discuss the study of social networks on the impact of
factors of human relationships. For instance, our online music downloads. Through the assumption of an
consumption behavior is often affected by the empirical study it is to verify the theory in practical
advertisement spoke by the superstar. Therefore our applications. On the other hand, it is correctly to
consumption is not rational. It is subject to the impact of understand the social networks on the impact of online
a leader. The more products are bought and the more music download. Read phonetically.
people will join. It is just like a rolling snowball which As mentioned the diversity and complexity of social
becomes bigger and bigger. In reality, it is not completely network theory, this article adopts the direction about the
equal for each person who accesses to the information. users social characteristics of online music download.
The spread of information is affected by social relations We also combine the social network theory with
and the impact of social network structure. The utility of hypothesis. So the proposal of the hypothesis is also
each person is not isolated. Personal relations will be related to characteristics of the user's social network. We
affected at any time. An individual's social structure and discuss characteristics of the users social network
position will affect his information and the access of including the dependence on social networking, advice of
resources. network center and emotional center of the network.
We mainly use the Granovetters advantage theory of We made a survey to the people with a strong
weak ties. Simply speaking weak ties mean the dependence on the social network, people with advice
relationship is far from us. It is obvious that weak ties do and emotional well-developed network who download
not share information so much and the difference of the online music. We can relatively and simply apply the
information is big. For instance, I can obtain more social network theory into the actual case of online music
information about the cost of a computers memory and downloads.
quality of a computer from a stranger in the computer city In order to research social network characteristics of
in comparison with my classmate when I buy a computer. the user, we should first study the attitudes of social
I have never had concerns about this kind of useful network. A persons dependence on social networks
information. People with weak ties can collect more affects his access to resources directly. The more
information because of the wide range of social networks. dependent he is on a strong social network to access
A person, good at the exchange of resources, can resources, the more likely wide the resource is. So, we
acquire the resource from the organization. The have come to an assumption that "people with a stronger
opportunity is more when the human relationship points dependence on the social network, the greater likelihood
to the valuable resources. A person seeking for a job can of online music download." "weak ties can mainly pass
have more opportunity when the relationship is across the resource of information and knowledge. Strong ties
between these two groups. Only the person with weak can pass the resource of a sense of trust and influence."
ties has the opportunity to become a bridge between the We can conclude that the advice network is attributed to
different groups. The strong ties are the opposite of week weak ties and the emotional network is attributed to
ties. The sharing of information is in a small range and strong ties. According to the theory of "strength
repeated for a person with strong ties. hypothesis of weak relationship" and " strength
For example, a couple of male and female contact hypothesis of strong relationship", we have come to
every day but they basically talk about the same thing. assumption two that "people with a larger counseling
That is about what they do today. A lot of information is needs in social network, the greater the possibility of
repeated. Even though, the strong ties cannot be ignored. online music download". Assumption three is "people
Chinese society is a human society and the human is with a greater emotional need, the greater likelihood of
more important than the information. Strong ties can online music download".
bring us the sense of trust. The social exchange is unlike
the economic exchange such as the trade of merchandise

2012 ACADEMY PUBLISHER


1284 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

In the case of no previous research, we propose the to this study. The Likert rating scale is the most common
following three hypotheses based on the social network format for opinion or attitude surveys. In here, we use the
theory: five-point Likert scale questionnaire in the format design
H1: People with a stronger dependence on the social of the questionnaire. Questionnaire consists of three
network, the greater likelihood of online music download parts: the sample definition, measurements of
H2: People with a larger counseling needs in social independent variable and dependent variable, and the
network, the greater the possibility of online music population variable. The part of sample definition designs
download four questions. There are four small measurements in the
H3: People with a greater emotional need, the greater survey. They are the situation of internet usage and online
likelihood of online music download music usage, measurements of independent variable and
dependent variable, and individual design problem based
B. Research Model
in part on the dependent variable and three independent
Based on the above mentioned, this paper presents variables.
factors in the social network within the intention model of Each part of design has about ten questions in order to
online music download. It is shown in Fig.1. measure the independent variable, the dependent variable,
Usually, the research of social networks is required to the population variable. We survey the personal basic
establish a causal model in order to apply to the practical information according to the different object in the
problem. This model is evolved into different forms. We survey.
study the impact of the social network to the online music The design of question is different for the personal
download and create a simple causal model. It is the basic information. In here, we design four questions to
method of measurements for independent and dependent investigate the personal basic information.
the variables. The first part of the questionnaire is the sample
According to the afore-mentioned three assumptions, definition. There are many questions in the sample
they are "people with a stronger dependence on the social definition in the survey of questionnaire. In the
network, the greater likelihood of online music questionnaire of the sample definition of this paper, the
download.", "people with a larger counseling needs in design of survey object is based on the existing authority
social network, the greater the possibility of online music and validity of the question. The first question is to
download", and "people with a greater emotional need, determine the average daily time spent online. This
the greater likelihood of online music download". question comes from the questionnaire of the paper of
We can select three factors of online music that are the Luo Jia De "whether the virtual relationship reflects in
three dimensions of dependence, consulting needs and the real life relationship" [15]. The second question is to
emotional needs of a social network. Therefore, we get investigate whether the downloaded of online music that
three independent variables which are the dependence, is based on research topics to define. The next two
consulting network, and emotional network of social questions are designed according to the actual situation of
networks. college students in connection with the usage of online
Through the design of questionnaires to measure these music.
three independent variables, we will discuss the specific The second part of the questionnaire is the core, that is,
design method and basis of the questionnaire in the next how to measure the independent variable and dependent
paragraph. variable. The measurement of independent variable and
dependent variable are used by the five-point scale
method (strongly agree, ordinary agree, do not agree,
strongly disagree). There are four questions designed
about the dependent variable. Its content is referred to the
attitude questionnaire of computer [16]. Questions such
as: "I have been using online music", "I would like to
introduce others to download online music," and so on. In
the part of independent variable, there is three
independent variables. The corresponding question is
Figure1. Research model designed based on the different independent variable. The
measurement question of reliance on social networks is
C. Measurement Scale referred to the attitude questionnaire of computer [17].
First of all, in the view of the convenience and features The ultimate purpose is to get a result of survey about the
of online music users, we chose in college students in the reliance of attitudes and social networks. Such as: "I think
survey. There is no questionnaire about the situation or that classmates and friends help me a lot in my learning
attitude of online music download in the current of life," "I like to communicate with other people to chat"
academic study. and other nine questions.
There are some survey questionnaires for commercial
use purpose. This kind of questionnaire is not applicable

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1285

Krackhardt divides people in the organization of the


social network into three. One is the emotional network, Table 1. Sample characteristics statistics
one is a consulting network, and the other one is the
intelligence network. Only the last study of personal
attitudes indicated that the emotional network is the most
influential factor in the formation of attitude. The
consulting network has influence in some of the issues
and the intelligence network is seldom mentioned [18].
Therefore, we only select two questions about the
emotional and consulting networks in this study. On this
basis, we added, modified, and completed the questions
in relation to the emotional needs of social networks and
measurement of the consultation requirements. Question
of emotional needs is like "I am willing to share feelings
and privacy with friends." and other nine questions.
Advisory-related topic is like "I often asked my
classmates about current events" and other ten questions.
This part of the design is primarily to test the hypothesis
that is also the focus of this study.
The third part of the questionnaire is the measurement the coefficient is too low, we should adjust it. Because it
of population variables, namely, basic personal means that what we measured are not the same concept.
information survey. There are many questionnaires in this Table 2 is the reliability analysis of this research
regard. Because the survey is aimed to college students, questionnaire.
the question is designed based on the actual situation of From the result of reliability analysis (see table 3), we
college students. In addition to the gender of students, we can get that the intention questionnaire about social
also design four questions such as the monthly cost of network and downloading internet music which consists
living, the annual cost of audio and video products etc. 32 questions has high coefficient of internal consistency
This part of design is to survey the characteristics of the (0.9382), and deleting any questions makes no sense of
object. On the other hand is to conduct descriptive improving effectiveness of questionnaire. As a
statistics and draw some conclusions. consequence, the table has high reliability.
In this paper, social network as a variable includes
IV. DATA ANALYSIS consultation network and emotional network. They will
be on factor analysis as independent variables to verify
A. Description of Sample the questionnaire structural validity. Total refers to
The form of the questionnaire about this research is eigenvalue, and the first components eigenvalue is
paper-and-pencil. Respondents are undergraduates who 9.401, the second is 3.920. Whats more, only two
have downloading music from internet experience. 200 factors eigenvalue are greater than 1, the sum of
questionnaires were sent out, and 161 of them were eigenvalues contributes 70.109% to total variance. In
returned. After eliminating 7 ones, we collected 154 other words, the first two factors can explain 70.109%
available questionnaires, which contribute 50% of the variation of primitive 19 variables. Analysis result shows
total sample. Then input these available questionnaires that the 19 questions can concentrate to two factors. After
into the computer, and process them by SPSS statistical determining the two dimensions structure of the table,
software. The basic characteristic of valid sample is rotate the factor of the table and then get the factor
shown as Table 1. rotation matrix. The result shows that the two factors
In addition, through SPSS statistics software we get questions are the same as the original questionnaire.
that average value of intention for downloading music C. Regression Analysis
from internet is 3.9464. It means that this intention is
salient among undergraduates and downloading music a. Regression analysis and hypothesis testing
from internet has been widely spread in the college In order to verify the hypothesis of the model, the
students lives. The result shows that the average value of research do the regression analysis about related factors
intention for a man is 4.25, and it is 3.6935 for a woman. after assuring that measuring tools achieve great
We can conclude that men are more interested in reliability and validity.
downloading music from internet than women. The model coefficients are calculated by SPSS
B. Reliability and Validity Analysis statistics software, about downloading music from
internet intention and social network dependency. We can
This research is used content consistency reliability
find that the regression model constant term is 1.486,
which assesses the variable groups reliability through
regression coefficient of independent variable
calculation coefficients a of Cronbach. The value range
dependence of social network is 0.626. Thus, we can
of the a is from the negative infinity to +1. The
get the regression equation: intention of downloading
coefficient is closer to +1, and the reliability is higher. If

2012 ACADEMY PUBLISHER


1286 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Table 2. Reliability analysis (n = 154) Table 3. Model Coefficients

The significance level of regression coefficient is


0.0000, it refuses to do zero hypothesis of the t testing. It
also shows that linear relationship between dependent
variables and independent variables is remarkable and it
can make linear models. For H2: the person who has
greater consultation needs in social network may tend to
download music from internet. It has a positive
correlation between consultation network and
downloading music from internet. And the regression
coefficient is 0.655. So H2 is right, and we can draw a
conclusion: the needs of a person to consultation network
have influence on whether he downloads music from
internet.
Finally, it is the verification of assumption three. As
shown in Table 3-6, it is the model coefficient table
which the dependent variable is intention of downloading
music from internet, and the independent variable is
friendship network.
From the Table 4, we can find that the regression
model constant term is 2.717; regression coefficient of
independent variable friendship network is 0.330. Thus,
we can get the regression equation: intention of
downloading music from internet = 0.330 friendship
network + 2.717.
The significance level of regression coefficient is
music from internet = 0.626 dependence of social 0.0000, it refuses to do zero hypothesis of the t testing. It
network + 1.486. also shows that linear relationship between dependent
The significance level of regression coefficient is variables and independent variables is remarkable and it
0.0000, it refuses to do zero hypothesis of the t testing. It can make linear models. For H3: the person who has
also shows that linear relationship between dependent greater emotion needs in social network may tend to
variables and independent variables is remarkable and it download music from internet. It has a positive
can make linear models. For H1: the person who is more correlation between friendship network and downloading
dependent on social network may tend to download music from internet. And the regression coefficient is
music from internet. It has a positive correlation between 0.330. So H3 is right, and we can draw a conclusion: the
dependence of social network and downloading music needs of a person to friendship network have influence on
from internet. And the regression coefficient is 0.626. So whether he downloads music from internet.
H1 is right, and we can draw a conclusion: the To sum up data analysis, we can get that the three
dependence of a person to social network has influence hypotheses H1, H2, and H3 are all available.
on whether he downloads music from internet.
b. Comparative analysis
With the help of result of assumption one, we can
verify assumption two and three in the same way. As According to the above conclusions, we know that the
shown in Table 3, it is the model coefficient table which effectiveness of consultation network and friendship
the dependent variable is intention of downloading music
from internet, and the independent variable is
consultation network. Table 4. Model coefficients
From the table we can find that the regression model
constant term is 1.655, regression coefficient of
independent variable consultation network is 0.628.
Thus, we can get the regression equation: intention of
downloading music from internet = 0.628 consultation
network + 1.655.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1287

network have positive correlation with downloading


music from internet. As two branches of social network, Table 6. Model Coefficient
we can combine them to do multiple linear regression
analysis. The analysis data are shown in Table 5 and 6.
Predictors: (Constants) Friendship network
Predictors: (Constants) Friendship network,
consultation network
Dependent Variable: intention of downloading music
from internet
From the Table 5, we can find that the significant
probability of F distribution is 0.0000. It means that the is more effective to propagandize and spread
linear correlation between dependent variable and downloading music from internet to these people.
independent variable is remarkable and it can make linear Second, the person who has greater consultation needs
models. in social network may tend to download music from
The Table 6 is a model coefficient table that the same internet. Consultation network belongs to weak ties of
dependent variable constructs regression equation with transferring information, knowledge and other resources
two independent variables. with wide range of social web and large gathered
Dependent Variable: intention of downloading music information. The person who has greater consultation
from network needs in social network pays attention to weak ties. The
From the Table 6, we can get two regression equations. weak ties just like a information channel. Information
The second one is: Intention of downloading music from contents have directly influence on intention of
internet = 1.576 + 0.065 friendship network + 0.583 downloading music from internet. So, the person who
consultation network. always chats with others or consults information online
Thus, consultation network has more influence on may be possible to download music from internet. The
downloading music from internet than friendship conclusion is important to help merchants know
network. We mainly consider the value of Beta here. So, characteristics of users.
we have a conclusion that consultation network has more Third, the person who has greater emotion needs in
influence on downloading music from internet than social network may tend to download music from
friendship network. internet. Friendship network belongs to strong ties of
The structure model which is based on the data shows transferring sense of trust, influence and other resources.
that three hypotheses are available. Besides, we also find Human relationship is very important in Chinese society.
that consultation network and friendship network have Strong ties bring us sense of trust. This feeling has a great
difference influence on downloading music from internet, influence on what they choose to whom with large
and men tend to it. emotion needs. As the data analysis shows that emotion
needs have positive correlation with intention of
V. DISCUSSION AND CONCLUSIONS downloading music from internet. In other words, the
person who has greater emotion needs in social network
The paper aims to research the influence of social may tend to download music from internet.
network to downloading music from internet, and what Fourth, consultation network has more influence on
kinds of network influence it. Based on the new research downloading music from internet than friendship
perspective, we get some significant conclusions through network. From the regression analysis with the same
verifying research and data analysis. dependent variable (intention of downloading music from
First, the person who is more dependent on social network) and two independent variables (consultation
network may tend to download music from internet. network and friendship network), we get that regression
People communications depend on social network, and coefficient of consultation network is much larger than
interpersonal relationship will influence our activities. friendship network. Strong ties share narrow range of
The person who is more dependent on social network information with repetition. Although both of them have
may get widely network resources, and tend to download positive correlation influence on downloading music
music from internet. At here, we can consider people who from internet, the degree differs a lot. Therefore, for
are more dependent on social network are active with merchants, they should pay more attention to network
frequent social interactions. Therefore, for businesses, it music spreading between weak ties.
Fifth, men tend to download music from internet than
Table 5. Analysis variance women. Nowadays, downloading music from internet as
a new form of music appreciation has the rapid
development of social network. Contemporary
undergraduates have stronger ability to accept new things
than other groups in society. Whats more, men can use
new things faster than women. From this conclusion,
merchants may see characteristics of users who download
music from internet.

2012 ACADEMY PUBLISHER


1288 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Overall, this research mentions some new opinions, but [7] Robins, Garry, Philippa Pattison, and Peter Elliott,
the following limitations also exist: First, the randomness Network Models for Social Influence Processes,
of questionnaire is limited. We just sent questionnaire to Psychometrika, 66 (2), 16190,(2001).
undergraduates of Beijing University of Posts and [8] Dholakia, Utpal M., Richard P. Bagozzi, and Lisa Klein
Pearo , A Social Influence Modelof Consumer
Telecommunications. Second, there are many factors Participation in Network- and Small-Group-Based Virtual
influencing downloading music from internet. Because of Communities, International Journal of Research in
research topic, we just discuss the social network, and Marketing, 21 (3), 24163(2004).
give no demonstration of other factors. Finally, we design [9] Kozinets, Robert V., The Field Behind the Screen: Using
19 questions to assess consultation network and Netnography for Marketing Research in Online
friendship network in this research. Actually there is no Communities, Journal of Marketing Research, 39
certainly way to integrate them to an effective guideline. (February), 6172 (2002).
This is mainly the theoretical problems. At present, the [10] Godes, David and Dina Mayzlin, Using Online
most effective strategy is to verify the shortage of theory Conversations to Study Word-of-Mouth Communication,
Marketing Science, 23 (4), 54560(2004).
through lots of empirical study, and make a rigorous table [11] Chevalier, Judith A. and Dina Mayzlin, The Effect of
like psychology. Word of Mouth on Sales: Online Book Reviews, Journal
This research just from a shallow perspective discusses of Marketing Research, 43 (August), 34554(2006).
the influence of the social network to downloading music [12] Dellarocas, Chrysanthos N., Strategic Manipulation of
from internet. It summarizes the theory of social network Internet Opinion Forums: Implications for Consumers and
and analyses the influence on downloading music from Firms, working paper, Robert H. Smith School of
internet from two dimensions. Whats more, we also find Business, University of Maryland (2005).
other topics which can be deeply researched. One is [13] Narayan, Vishal and ShaYang, Trust between Consumers
subsequent researchers can expand respondents, and in Online Communities: Modeling the Formation of
Dyadic Relationships, working paper, Stern School of
analyze the influence as a whole. The other is from some Business, New York University (2006).
new perspectives to analyze it, like social capital theory [14] Stephen, Andrew T. and Olivier Toubia, Deriving Value
and structural hole. Because the social network theory is from Social Commerce Networks, Journal of Marketing
complicated, this research is mainly from, the strength of Research, 47 (April), 21528(2010).
weak ties and strong ties power hypothesis, these two [15] Luo jiade, social network analysis (2nd edition), China
perspectives to do the analysis. social science publishing house (2009)
[16] Luo jiade, social network analysis (2nd edition), China
ACKNOWLEDGMENT social science publishing house (2009)
[17] Luo jiade, social network analysis (2nd edition),China
Project supported by the National Social Science social science publishing house (2009)
Foundation of China (Grant No. 11BGL041) and the [18] Krackhardt, David, and Daniel Brass, Intra-Organizational
China Fundamental Research Funds for the Central Networks: The Micro Side." In Stanley Wasserman &
Universities (Grant No.BUPT2011RC1005). Joseph Galaskiewicz (eds.), Advances in the Social and
Behavioral Sciences from Social Network Analysis.
Beverly Hills: Sage, pp. 209-230.
REFERENCES
[1] See the Website of Fraunhofer Institute for Integrated
Circuits (IIS)
http://www.iis.fraunhofer.de/EN/bf/amm/index.jsp .and the Liu Dan, received her Ph.D degree of Management in 2006 at
history of MP3-development at Renmin University of China. She is currently an associate
http://www.iis.fraunhofer.de/EN/bf/amm/mp3history/mp3h professor at School of Economic and Management, Beijing
istory01.jsp University of Posts and Telecommunications. Her major
research interests include technological innovation
[2] Selma Borovac,Joanna Golata, Tobias Mller-Prothmann, management and strategic management.
Edda BehnkenIntegration of Customer Knowledge for
the Generation of Service Innovation in the Music
Industry, Working paper(2010).
[3] Selma Borovac,Joanna Golata, Tobias Mller-Prothmann,
Edda Behnken, Integration of Customer Knowledge for the Yang Tianchi, is major in Telecommunication Engineering
and Management. He is an undergraduate student at
Generation of Service Innovation in the Music Industry, International School, Beijing University of Posts and
Working paper(2010). Telecommunications.
[4] Fueglistaller, U. From Service Management towards
Service Competence An Entrepreneurial Approach, in
Spath, D. and Fhnrich, K. (eds.): Advances in Services
Innovations, Springer, pp. 114-127 (2007).
[5] Leenders, Roger Th.A.J. Modeling Social Influence Tan Liang, is major in Telecommunication Engineering and
Through Network Autocorrelation: Constructing the Management. He is an undergraduate student at International
School, Beijing University of Posts and Telecommunications.
Weight Matrix, Social Networks, 24 (1), 2148, (2002)
[6] Phelps, Joseph E., Regina Lewis, Lynne Mobilio, David
Perry, and Niranjan Raman, Viral Marketing or Electronic
Wordof-Mouth Advertising: Examining Consumer
Responses to Pass Along Email, Journal of Advertising
Research, 44 (4), 33348, (2004).

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1289

A New Method of Medical Image Retrieval for


Computer-Aided Diagnosis
Hui Liu
Dept. of Computer Science & Technology, Shandong University of Finance and Economics, Jinan, China
Digital Media Technology Key Lab of Shandong Province, Jinan, China
Email: liuh_lh@126.com

Guochao Sun
Central Hospital of Binzhou, Binzhou, China
Email: sunguochaobz@163.com

AbstractIn the field of computer-aided diagnosis the topics models: Markov random field (MRF) models [5,6] and
of image retrieval is an important approach. According to spatial context, which refers to spatial autocorrelation and
the difference of retrieval technology, modeling spatial the image processing community. Over the last decade,
context (e.g., autocorrelation) is a key challenge in image several researchers [7, 8] have exploited spatial context in
classification and retrieval problems that arise in image
regions. This work proposes a new approach to the retrieval
classification using MRF to obtain higher accuracies over
of medical images from traditional Markov Random Field their counterparts (i.e., noncontextual classifiers).
model. Contrasting with previous work, this method relies However, it should be noted that those relative position
on coping with the ambiguity of spatial relative position concepts are rather ambiguous, they defy precise
concepts: a new definition of the geometric relationship definitions, but human beings have a rather intuitive and
between two objects in a fuzzy set framework is proposed. common way of understanding and interpreting them [9],
Furthermore, Fuzzy Attributed Relational Graphs (FARGs) it is clear that any all-or-nothing definition leads to
are used in this framework, where each node represents an unsatisfactory results in several situations, even of
image object and each edge represents the relationship moderate complexity. Therefore, relative position
between two objects. The generalization performance of this
approach is then compared with alternative models over the
concepts may find a better understanding in the
IRMA dataset. These experiments show that our method framework of fuzzy set, as fuzzy relationships. The
outperforms the traditional models, such as MRF, FGM, earlier methods represented a fuzzy set depending on an
SVM e.g., in terms of several standard measures. angle , on the objects, the angle (a,b) is measured
between the segment joining two points a and b and the
Index Termsspatial context, spatial relative position, fuzzy x-axis of the coordinate frame [10]. Other methods use
set, Fuzzy Attributed Relational Graphs(FARGs). projections of regions on the coordinate axes and try to
reason about spatial relations either using dominance
relations [11] or fuzzy logic [12]. More recent methods
I. INTRODUCTION have included approaches based on neural networks [13],
As we all known, an enormous mass of digital image mathematical morphology [9], and gravitational force
data is stored in big archives, e.g. at medicine models [14].
radiographs, publishing companies, news agencies and In this paper, a new fuzzy set framework for medical
also on our home desktop computers[1]. For example, in image retrieval is proposed. In addition to the position
the medical image research domain, considering an and the scale of the object in spatial geometric
electronic multimedia patient record, this may help to relationships, we also consider the orientation, which can
find similar cases. Especially when using original help future image retrieval systems to evaluate the
medical DICOM (Digital Imaging and Communication in relative position and orientation of objects in an image
Medicine)[2] files for processing this can aid in diagnosis better. Furthermore, we carried out a great deal of
and treatment. experiments by using of medical images, which
So, all kinds of retrieval systems are necessary in order illustrating the excellent impacts of this method
to find useful data again [3], in a previous study it was
shown that people who describe images often use II. FUZZY APPROACH FOR SPATIAL CONTEXT
position descriptions like On the left side or Below Several previous studies [6, 7] have shown that
object x [4]. This is due to the fact that what is depicted modeling of spatial geometric dependency (often called
in an image is highly subjective. Spatial information, context) during the image process can improve overall
however, is mainly objective. classification accuracy. Spatial geometric context can be
There are two major notions for incorporating spatial defined by the relationships between spatially adjacent
geometric dependency into classification/prediction object in a small neighborhood.

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1289-1295
1290 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

A set of random variables, the spatial geometric be considered to be to some extent above A; (b) Object B is
interdependency relationship of which is represented by strongly above A, also to the left and right of A partly
an undirected graph (i.e., a contiguity matrix), are called a
Markov Random Field (MRF) [5]. The Markov property The applications that are anticipated from this work are
specifies that a variable depends only on its neighbors related to structural pattern recognition, where we are not
and is independent of all other variables. just interested in the dominating relationships between
The essential idea is to specify the pairs of locations objects: an object may satisfy several relationships with
that influence each other along with the relative intensity respect to the other components of the image (see e.g.,
of interaction. The sites in S (where S denotes the spatial Figure.1 and it is clear that the shape of the considered
framework) are related to one another via a neighborhood objects has to play an important role in assessing its
system. A neighborhood system for S is defined as: relative position, any all-or-nothing definition is
difficult to accord with actual image spatial context, even
N = { N i | i S } . of moderate complexity.
So, based on neighborhood system S described in
MRF, a direction can be defined by angle = ( 1 , 2 ) in
where N i is the set of sites neighboring i. The
the 3DEuclidean space, where:
neighboring relationship has the following properties:

1 [0,2 ]
(1) a site is not neighboring to itself: i N i ;
(2) the neighboring relationship is mutual: and 2 = ,
2 2
i N j j Ni .
Then the direction in which the relative position of an
For a regular lattice S, the neighboring set of i is object with respect to another one is evaluated as Eq.(1):
defined as the set of nearby sites within a radius of r:
N i = { j S | [dist (object j , object i )]2 r , j i} . u1,2 = (cos2 cos1 , cos2 sin1 , sin2 )t
where dist ( A, B) denotes the Euclidean distance (1)
Now, between the objects A(reference object) and B,
between A and B. Note that sites at or near the
we can define the degree to which A is in direction u1 , 2
boundaries have fewer neighbors.
with respect to A. And membership function (A)
denotes the fuzzy set defined in the image in such a way
that points of areas which satisfy to a high degree the
relation to be in the direction u , with respect to
1 2

reference object A have high membership value. We


denote by P that looks precisely at the domains of space
that are visible from a reference object point in the
direction u1 , 2 , and by Q any point in A, then ( P, Q )
(e.g., see Eq.(2))expresses the angle between the vector
QP and the direction u1 , 2 computed in [0, ]
(a)
QP u
( P, Q) = arccos 1 , 2
(2)
QP

We then determine for each point P the point Q of A
leading to the smallest angle , denoted by min . In the
crisp case, this point Q is the reference object point from
which P is visible in the direction the closest
to u1 , 2 min ( P) = min QA ( P, Q) :The fuzzy
landscape (A) at point P is then defined as:
(A)( P) = f ( min ( P)) , where f is a decreasing
function of [0, ] into [0,1] .
(b) So, the evaluation of relative position of B with respect
Figure 1. Examples where the relative position of objects with to A is given by a function of (A)( x) and
respect to the reference object is difficult to define in an all-or-
nothing manner: (a) Object B is to the right of A, but it can also A ( x) for all x in object B. An appropriate tool for
defining this function is the fuzzy pattern-matching

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1291

approach [14]. Following this approach, the evaluation of fuzzy set defined over the set of linguistic values 2
the matching between two possibility distributions
={small, medium, large}. We denote the node label of
consists of two numbers, a necessity degree N (a
node j by Eq.(5):
pessimistic evaluation) and a possibility degree (an
optimistic evaluation), as often used in the fuzzy set
community(e.g., see Eq.(3)): ( j ) = {(ai , Aji ) | Aji (i ); i = 1,, nA}
A

(B) = sup
xB
t[ ( A )( x), B ( x)] (3) (5)
1, 2 where ( i ) denotes the fuzzy power set of i . Each
The possibility corresponds to a degree of intersection node-attribute ai is allowed to occur only once
between the fuzzy sets B and (A ) , while the necessity in ( j ) .Edge-attributes are treated similarly. Each edge
corresponds to a degree of inclusion of B in (A ) . in the FARG has attributes from the
set R = {ri | i = 1,, nR } .We denote the set of linguistic
They can also be interpreted in terms of fuzzy
mathematical morphology, since the possibility A (B) is values associated with edge-attribute ri


by i = {Lik | k = 1,, nri } . The value of an edge-
1, 2

equal to the dilation of B by (A) at the origin. attribute ri for an edge e=(j,k) is a fuzzy set Rei defined
Several other functions combining (A) and A ( x) over i .
can be constructed. An average measure can also be B. Graph matching
useful from a practical point of view, and is defined as R. Krishnapuram and R. Medasani presented a a fuzzy
Eq.(4): graph matching algorithm called FGM [18] that uses
1

ideas from relaxation labeling and fuzzy set theory to
N A1 , 2 ( B) = B ( x ) ( A )( x ) (4) solve the sub-graph isomorphism problem. To extend
B xB FGM to FARGs, we need to define the
compatibility u ij [0,1] , which is a quantitative measure
where |B| denotes the fuzzy cardinality of B
B = B ( x) of the (absolute) degree of match between node i V A
and node j VB , given the current fuzzy assignment
xB
matrix U . We start with the definition of compatibility
III. FUZZY ATTRIBUTED RELATIONAL GRAPHS uij as Eq.(6):
AND GRAPH MATCHING
n+1 m+1
mklmkl
A. Fuzzy attributed relational graphs(FARGs)
A graph G = (VG , EG ) is an ordered pair of a set of
uij = w 0.5
ij nB
k=1 l=1 j
(ki)(l j)
nodes VG and a set of edges EG . An edge in G
connecting nodes u and v is denoted by (u,v), i = 1, , n + 1, and j = 1, , m + 1
where (u, v) EG . A Fuzzy Attributed Relational Graph (6)
(FARG) is used to model the vagueness associated with where wij is the degree of match between (the
the attributes of nodes and edges. In our application, each
node in the FARG represents an object in the image, and attributes of) node i V A and node j VB , mkl [0,1] is
each edge between the corresponding two nodes the matching score between the edge (i, k ) E A and
represents the relationship between these objects. All
edge ( j , l ) E B , M is the matrix [mkl ] , M = [mkl ] is
nodes have attributes from the set A = {ai | i = 1, , n A } .
the crisp assignment matrix closest to M atisfying the
We denote the set of linguistic values (labels)
constraints by Eq.(1) for i = 1,, n + 1 and j = 1,, m + 1
associated with attribute ai by Ai = {Cik | k = 1,, nai } .
, and n Bj is a normalization factor equal to the number of
The value of an attribute ai at node j is a fuzzy set Aji
edges (with nonzero weights or attribute values) that are
defined over i .For example, the node attribute
incident on node j VB . Note that M that acting as a
a1=position_label may be a fuzzy set defined over the
filter so that each edge in graph B which is incident on
linguistic category set 1 ={up, down, left, right}, and
node j will contribute to uij only once. In other words,
position_label of node j may have membership values,
e.g. 0.5, 0.2, 1 and 0 in corresponding to above four out of the double summation in Eq.(5), only terms
position labels, respectively, Aji={0.9, 0.2, 0.1}. survive. Also, wij is raised to the power 0.5 for
Similarly, the node-attribute a2=size_label may be a enhancement purposes..

2012 ACADEMY PUBLISHER


1292 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

IV. EXPERIMENTAL RESULTS comparison is made by the precisions and recalls of each
method on all the medical image categories.
A. The Data Set
Figure 2 shows the mean average retrieval precision of
Experiments were performed with radiographic images different methods over all radiographic categories along
from the IRMA (Image Retrieval in Medical with those of previous works. Our method presents a new
Applications) dataset [19]. This is a growing collection of fuzzy set framework combining Markov random field
radiographic images acquired in RWTH Aachen
University of Technology Hospital, Germany. It is used
as reference for medical image retrieval tasks. It currently TABLE I.
contains 15363 arbitrarily selected anonymous STATISTICS OF THE 10 FAMILIAR RADIOGRAPHS
CATEGORIES
radiographic images for which the ground truth
information is provided. The radiographs span 193 Category Explanation No.
in db
categories and depict various anatomic specimens of CRANIUM round part of the skull that contains peoples 654
patients of various ages, genders, and pathologies[20]. brain
BRAIN organ inside the head 923
We selected 4341 medical images from 10 familiar SPINE row of small bones that are connected together 526
radiographs categories, including cranium, brain, spine, down the middle of the back
arm, chest, abdomen, leg, pelvis, liver and hands, to ARM two long parts of the body that are attached to 112
peoples shoulders
implement our experiments. Table 1 is the statistics of the CHEST the top part of the front of the body, between the 627
10 categories we used and the corresponding neck and the stomach
explanations. ABDOMEN the part of the body below the chest that contains 307
the stomach, bowels
LEG one of the long parts that connect the feet to the 198
B. Experiment rest of the body
PELVIS the wide curved set of bones at the bottom of the 204
In the first experiment, we conducted experiments to body that the legs and spine are connected to
compare the performance between our approach and LIVER a large organ in the body that produces bile and 619
traditional methods. To be consistent with previously cleans the blood
HANDS parts of the body at the end of peoples arms 171
published methods, we used the implementations
provided by the authors for each method that we tested,
including their suggested distance thresholds. Finally, the

HANDS

LIVER

PELVIS

LEG
categ ory

ABDOMEN

CHEST

ARM

SPINE

BRAIN

CRANIUM

0 0.2 0.4 0.6 0.8 1


p recision

MRF FGM Our Method

Figure 2. Mean average retrieval precision[%] for each category by using different methods. The different shades of color denote
different method and the blocks of bars denote different category

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1293

(MRF) and morphological idea, uses the Fuzzy Attributed retrieval accuracy for all methods. On the other hand,
Relational Graphs (FARGs) to model the vagueness LIVER, CHEST, ABDOMEN and CRANIUM
associated with the attributes of image objects and their are rather different classes that contain complicated
relationships. It solves the problem of all-or-nothing geometric relationships of different objects, and our
definition that leads to unsatisfactory results in several method show higher retrieval accuracy distinctly than
situations, and does better work on image retrieval other two models. Thus, the impact of fuzzy set is much
precision than traditional methods. stronger whereas other, more prominent examples might
Here we applied the correlation analysis for the not even be included in the testing data.
different tasks individually and for all tasks jointly. On
the one hand, HANDS, LEG and ARMS are
among the three simplest structure classes and show high

Figure 3. The top 10 retrieval results about liver category using our method

scheme to simulate the retrieval process conducted by


real users. In each iteration, the system marks the first
three incorrect images from the top 100 matches as
For example, Figure 3 and Figure 4 have shown the irrelevant examples, and also selects at most 3 correct
top-10 image retrieval results, according to FARGs images as relevant examples (relevant examples in the
obtained by our method, that are closest to the query previous iterations are excluded from the selection). The
sample(e.g. liver and chest) respectively. It can be seen evaluation measures used in CBIR have been greatly
that the prototypes capture the diversity of the data set affected by those used in text-based information retrieval
very well. [21]. A straightforward and popularly used measure is the
PR-graph which depicts the relationship between
C. Experiment precision and recall of a specific retrieval system. This
In the second experiment, we conducted experiments measure is used in this paper. Concretely, for every recall
to compare the performance between our approach and value ranging from 0.0 to 1.0, the corresponding
classic SVM, TSVM (Transductive SVM) method. We precision value is computed and then depicted in the PR-
performed several relevance feedback experiments to graph.
evaluate the effectiveness of above approaches over a
part of IRMA dataset that containing 3218 medical image
from 29 category. We designed an automatic feedback

2012 ACADEMY PUBLISHER


1294 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Figure 4. The top 10 retrieval results about chest category using our method

Figure 5. Average PR-graphs of SVM, TSVM, and our method at the 0th, 4th, and 8th relevance feedback round

The general PR-graphs at the 0th, 4th and 8th round of after different rounds of relevance feedback, a BEP-graph
relevance feedbacks are shown in Figure 5 (a) to (c) is obtained, where the horizontal axis enumerates the
respectively. Here note that the performance at the 0th round of relevance feedback while the vertical axis gives
round corresponds to the performance before starting the BEP value.
relevance feedback, that is, the retrieval performance with The general BEP-graphs are presented in Figure 6 (a)
only the initial query. to (c), which also implies the performance of our method
A deficiency with the PR-graph is that it can hardly is always the best
reflect the changes of the retrieval performance caused by
relevance feedback directly. Therefore, another graphical V. CONCLUSIONS
measure is employed in this paper. Usually, a CBIR
system exhibits a trade-off between precision and recall, Uncertainty pervades every aspect of CBIR. This is
to obtain high precision usually means sacrificing recall because image content cannot be described and
and vice versa. Considering that in CBIR both the represented easily, user queries are ill-posed, the
precision and recall are of importance, here BEP (Break- similarity measure to be used is not precisely defined, and
Event-Point) is introduced into CBIR as an evaluation relevance feedback given by the user is approximate. To
measure. By definition, if the precision and recall are address these issues, fuzzy sets can be used to model the
tuned to have an equal value, then this value is called the vagueness that is usually present in the image content,
BEP of the system [13]. The higher the BEP value, the user query, and the similarity measure. This allows us to
better the performance. Through connecting the BEPs retrieve relevant images that might be missed by
traditional approaches.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1295

Figure 6. Average BEP-graphs of SVM, TSVM, and our method using 200, 500 and 1000 CT image

[11] J.M. Keller, L. Sztandera. Spatial Relations among Fuzzy


ACKNOWLEDGMENT Subsets of an Image. Intl Symp. Uncertainty Modeling
and Analysis, pp:207-211, 1990.
This work is supposed by National Natural Science [12] K. Miyajima, A. Ralescu. Analysis of Spatial Relations
Foundation (No.61003104), Ph.D Foundation of between 2D Segmented Regions. European Congress
Shandong Province (BS2011DX025), the Postdoctoral Fuzzy and Intelligent Technologies, pp:48-54, 1993.
Granted Financial Support (No. 20110491579) and [13] J.M. Keller and X. Wang. Learning Spatial Relationships
Technology Star Program of China (No. 20100301). The in Computer Vision. Intl Conf. Fuzzy Systems, pp:118-
124, 1996.
authors also gratefully acknowledge the helpful
[14] P. Matsakis, L. Wendling, J. Desachy. A New Way to
comments and suggestions of the reviewers, which have Represent Relative Position between Areal Objects. IEEE
improved the presentation. Trans on Pattern Anal. & Machine , 21(7):634-643, 1999
[15] K.P. Chan and Y.S. Cheung. Fuzzy-Attribute Graph with
REFERENCES Application to Chinese Character Recognition. IEEE Trans
on Systems, Man, and Cybernetics, 22(1):153-160, 1992
[1] Alexandra Teynor and Hans Burkhardt, Patch Based [16] LIU Hui, ZHANG Yun-feng, Fuzzy set based image
Localization of Visual Object Class Instances, MVA2007 retrieval by relationship of objects, Innovative Computing,
IAPR Conference on Machine Vision Applications, May, Information and Control- Express Letters, 3(3):733-738,
2007, pp.211-214 Tokyo, JAPAN September 2009
[2] Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., [17] D. Dubois and H. Prade. Weighted Fuzzy Pattern
Jain, R., Content-based image retrieval at the end of the Matching. Fuzzy Sets and Systems, pp:313331, 1988.
early years. IEEE Transactions on Pattern Analysis and [18] R. Krishnapuram, R. Medasani. A Fuzzy Approach to
Machine Intelligence 22 No. 12:1349-1380, 2000 Graph Matching. Proc. IFSA Congress Conf., pp:1029-
[3] LIU Hui, ZHANG Cai-ming, JI Xiu-hua and ZHANG 1033, Aug. 1999.
Yun-feng, An Algorithm for Co-training in Medical Image [19] T.M. Lehmann, M.O. Gld, C. Thies, B. Plodowski, D.
Retrieval, International Journal of Innovative Computing, Keysers, B. Ott, H. Schubert, IRMA - Content-based
Information and Control, Vol.5(12):4327-4333, December image retrieval in medical applications, in Proc. 14th
2009 World Congress on Medical Informatics (Medinfo 2004),
[4] L.Hollink, A.Th.Schreiber, B.Wielinga, and M.Worring. IOS Press, Amsterdam, vol. 2, pp. 842-848, 2004.
Classification of user image descriptions. Journal of [20] http://ganymed.imib.rwth-aachen.de/irma/datasets_en.php
Human Computer Studies, November, 2004. [21] H. Muller, W. Muller and D. M. Squire, Performance
[5] S. Li. Markov random field modeling. Computer Vision. evaluation in content-based image retrieval: Overview and
New York: Springer-Verlag, 1995. proposals, Pattern Recognition Letters, vol.22, no.5,
[6] Y. Jhung, P. H. Swain. Bayesian contextural classification pp.593-601, 2001
based on modified M-estimates and Markov Random
Fields, IEEE Trans on Pattern Anal. & Machine Intell.,
34(1):6775, 1996.
[7] A. H. Solberg, T. Taxt. A Markov random field model for Hui LIU, works in Dept. of Computer Science &
classification of multisource satellite imagery. IEEE Trans Technology, Shandong University of Finance and Economics,
on Geosci.Remote Sensing, 34(1):100113, 1996. Jinan, China, and Digital Media Technology Key Lab of
[8] C. E. Warrender, M. F. Augusteijn. Fusion of image Shandong Province, Jinan, China. Her research interests include
classifications using Bayesian techniques with Markov computer application based on CT image.
rand fields. Int. J. Remote Sens., 20(10):19872002, 1999.
[9] I.Bloch. Fuzzy Relative Position Between Objects in
Image Processing: A Morphological Approach, IEEE Guochao SUN, works in Central Hospital of Binzhou,
Trans on Pattern Anal. & Machine,21(7):657-664, 1999 Binzhou, China. His research interests include medical image
[10] R.Krishnapuram, J.M. Keller, Y. Ma. Quantitative analysis, computer-aided diagnosis.
Analysis of Properties and Spatial Relations of Fuzzy
Image Regions. IEEE Trans on Fuzzy Systems, 15(3):222
233, 1993.

2012 ACADEMY PUBLISHER


1296 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

A Detailed Study of NHPP Software Reliability


Models
(Invited Paper)
Richard Lai*, Mohit Garg
Department of Computer Science and Computer Engineering,
La Trobe University, Victoria, Australia

AbstractSoftware reliability deals with the probability that computers increase, the possibility of a crisis from
software will not cause the failure of a system for a specified computer failures also increases. The impact of failures
time under a specified condition. The probability is a ranges from inconvenience (e.g., malfunctions of home
function of the inputs to and use of the system as well as a appliances), economic damage (e.g., interruption of
function of the existing faults in the software. The inputs to
banking systems), to loss of life (e.g., failures of flight
the system determine whether existing faults, if any, are
encountered. Software Reliability Models (SRMs) provide a systems or medical software). Hence, for optimizing
yardstick to predict future failure behavior from known or software use, it becomes necessary to address issues such
assumed characteristics of the software, such as past failure as the reliability of the software product. Using
data. Different types of SRMs are used for different phases tools/techniques/methods, software developers can
of the software development life-cycle. With the increasing design/propose several testing programs or automate
demand to deliver quality software, software development testing tools to meet the client's technical requirements,
organizations need to manage quality achievement and schedule and budget. These techniques can make it easier
assessment. While testing a piece of software, it is often to test and correct software, detect more bugs, save more
assumed that the correction of errors does not introduce any
time and reduce expenses significantly [10]. The benefits
new errors and the reliability of the software increases as
bugs are uncovered and then fixed. The models used during of fault-free software to software developers/testers
the testing phase are called Software Reliability Growth include increased software quality, reduced testing costs,
Models (SRGM). Unfortunately, in industrial practice, it is improved release time to market and improved testing
difficult to decide the time for software release. An productivity.
important step towards remediation of this problem lies in There has been much effort expended in quantifying
the ability to manage the testing resources efficiently and the reliability associated with a software system through
affordably. This paper presents a detailed study of existing the development of models which govern software
SRMs based on Non-Homogeneous Poisson Process (NHPP), failures based on various underlying assumptions [44].
which claim to improve software quality through effective
These models are collectively called Software Reliability
detection of software faults.
Models (SRMs). The main goal of these models is to fit a
Index TermsSoftware Reliability Growth Models, Non- theoretical distribution to time-between-failure data, to
Homogeneous Poisson Process, Flexible Models estimate the time-to-failure based on software test data, to
estimate software system reliability and to design a
I. INTRODUCTION stopping rule to determine the appropriate time to stop
testing and to release the software into the market place
Today, science and technology require high [4, 49]. However, the success of SRMs depends largely
performance hardware and high quality software in order on selecting the model that best satisfies the stakeholder's
to make improvements and achieve breakthroughs. It is need.
the integrating potential of the software that has allowed Recent research in the field of modeling software
designers to contemplate more ambitious systems, reliability addresses the key issue of making the software
encompassing a broader and more multidisciplinary release decision, i.e., deciding whether or not a software
scope, with the growth in utilization of software product can be transferred from its development phase to
components being largely responsible for the high overall operational use [8, 17, 50]. It is often a trade-off between
complexity of many system designs. However, in stark an early release to capture the benefits of an earlier
contrast with the rapid advancement of hardware market introduction, and the deferral of product release to
technology, proper development of software technology enhance functionality or improve quality. Despite various
has failed miserably to keep pace in all measures, attempts by researchers, this question still stands and
including quality, productivity, cost and performance. there is no stopping rule which can be applied to all types
When the requirement for and dependencies on of data sets. Furthermore, hardly any work has been done
on the unification of SRMs that can provide a solution for
stakeholders to model and predict future failure behavior
*Corresponding author, E-mail: lai@cs.latrobe.edu.au
of a software system in a better way. Software reliability

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1296-1306
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1297

engineering produces a model of a software system based testing, primarily because that is when the problems
on its failure data to provide a measurement for software appeared.
reliability. Several SRMs have been developed over the As technology has matured, root causes of incorrect
past three decades. As a general class of well developed and unreliable software have been identified earlier in the
stochastic process model in reliability engineering, Non life-cycle. This has been due in part to the availability of
Homogeneous Poisson Process (NHPP) models have results from measurement research and/or the application
been successfully used in studying hardware reliability of reliability models. The use of a model also requires
problems. They are especially useful to describe failure careful definition of what a failure is. Reliability models
processes which possess certain trends such as reliability can be run separately on each failure type and severity
growth and deterioration. Therefore, an application of level. Reliability models are mathematically intense,
NHPP models to software reliability analysis is easily incorporating stochastic processes, probability and
implemented. statistics in their calculations, and relying on maximum
The mathematical and statistical functions used in likelihood estimates, numerical methods (which may or
software reliability modeling employ several may not converge) and confidence intervals to model
computational steps. The equations for the models their assumptions.
themselves have parameters that are estimated using Despite their shortcomings, such as excessive data
techniques like least squares fit or maximum likelihood requirements for even modest reliability claims, difficulty
estimation. Then the models, usually equations in some of taking relevant non-measurable factors into account
exponential form, must be executed. Verifying that the etc. software reliability models offer a way to quantify
selected model is valid for the particular data set may uncertainty that helps in assessing the reliability of a
require iteration and study of the model functions. From software-based system, and may well provide further
these results, predictions about the number of remaining evidence in making reliability claims. According to the
faults or the time of next failure can be made, and classification scheme proposed by Xie [44] considering
confidence intervals for the predictions can be computed. the probabilistic assumption of SRM, and Kapur and
A model is classified as an NHPP model if the main Garg [17] considering the dynamic aspect of the models,
assumption is that the failure process is described by the SRMs can be categorized into three categories viz.
NHPP. Apart from their wide applicability in the testing Markov, NHPP and Bayesian models. We briefly discuss
domain, the main characteristic of this type of models is the key features of Markov models and then study the
that there exists a mean value function which is defined NHPP and Bayesian models in detail.
as the expected number of failures up to a given time. In
A. Markov models
fact SRM is the mean value function of an NHPP. These
models are flexible in nature as they can model both The Markov process represents the probabilistic failure
continuous and discrete stochastic processes. This paper process in Markov models. The software is represented
presents a detailed study of existing SRMs based on Non- by countable states, each state corresponding to a failure
Homogeneous Poisson Process (NHPP), which claim to (fault). The main characteristic of such model is that the
improve software quality through effective detection of software, at a given point of time, has count ably many
software faults. The definitions, assumptions and states and such states may be the number of remaining
descriptions of models based on NHPP will be provided, faults. Given that the process is at a specific state, its
with the aim of showing how a large number of existing future development does not depend on its past history.
models can be classified into different categories. The transition between the states depends on the present
state of the software and the transition probability. The
II. THE SPECTRUM OF SOFTWARE RELIABILITY MODELS failure intensity of the software is assumed to be a
discontinuous function which depends on the current state
The work on software reliability models started in the of the software.
early 70's; the first model being presented in 1972. Using this information, the Jelinski and Moranda (J-M)
Various models proposed in the literature tend to give model [14] is modeled as a Markov process model. Next,
quite different predictions for the same set of failure data. Schick and Wolvertan [35] modified the J-M model by
It should be noted that this kind of behavior is not unique considering a time dependent failure intensity function
to software reliability modeling but is typical of models and the time between failures to follow Weibull
that are used to project values in time and not merely distribution. In addition, Shanthikumar [41] proposed a
represent current values. Furthermore, a particular model Markov model with time dependent transition
may give reasonable predictions on one set of failure data probabilities. Then, Goel [6] modified the J-M model by
and unreasonable predictions on another. Consequently, introducing the concept of imperfect debugging. Later,
potential users may be confused and adrift with little Littlewood [25] proposed a model based on the semi-
guidance as to which models may be best for their markov process to describe modular structured software.
applications. Models have been developed to measure, Jelinski - Moranda De-eutrophication Model -
estimate and predict the reliability of computer software. The J-M model is one of the earliest models for
Software reliability has received much attention because assessing software reliability by drawing inferences
reliability has always had obvious effects on highly from failure data under some simple assumptions on
visible aspects of software development, testing prior to the nature of the failure process. These assumptions
delivery and maintenance. Early efforts focused on are:

2012 ACADEMY PUBLISHER


1298 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Assumptions Reality

Faults are repaired immediately Faults are not repaired immediately. A work-around may be to leave out duplicates and to accumulate test time
when discovered if a non-repaired fault prevents other faults from being found. Fault repair may introduce new faults. It might
be the case that newly introduced faults are less likely to be discovered as retesting is not as thorough as the
original testing.
No new code is introduced in It is frequently the case that fixed or new code is added during the test period. This may change the shape of
testing the fault detection curve.
Faults are only reported by the Faults may be reported by lots of groups due to parallel testing. If the test time of other groups is added, there is
testing group a problem of equivalency between an hour of the testing group and an hour of other groups (types of testing
may differ). Restricting faults to those discovered by the testing group eliminates important data.
Each unit of time is equivalent The appropriate measure of time must relate to the test effort. Examples are: calendar time, execution time and
number of test cases. However, potential problems are: the test effort is asynchronous (calendar time), some
tests create more stress on a per hour basis (execution time) and tests do not have the same probability of
finding a fault.
Tests represent operational It is hard to define the operational profile of a product, reflecting how it will be used in practice. It would
profile consist of a specification of classes of input and the probability of their occurrence. In test environments, tests
are continually being added to cover faults discovered in the past.
Tests represent adoption The rates of adoption, describing the number and type of customers who adopt the product and the time when
characteristics they adopt, are often unknown.
Faults are independent When sections of code have not been as thoroughly tested as other code, tests may find a disproportionate share
of faults.
Software is tested in isolation The software under testing might be embedded in a system. Interfaces with for example hardware, can hamper
the measurement process (test delay due to mechanical or hardware problems, re-testing with adjusted
mechanical or hardware parts).
Software is a black-box There is no accounting for partitioning, redundancy and fault-tolerant architectures. These characteristics are
often found in safety-critical systems.
The organization does not When multiple releases of a product are developed, the organization might significantly change, for example
change the development process and the development staff. After the first release, a different department might even
execute the development of the next release. It may also heavily influence the test approach by concentrating
on the changes made for corrective maintenance and preventive maintenance (a new functionality).

TABLE I. MODEL ASSUMPTIONS VS REALITY

1. At the beginning of testing, there are n0 faults in the


software code with n0 being an unknown but fixed = [n m t (4)
number.
2. Each fault is equally dangerous with respect to the
probability of its instantaneously causing a failure. According to equation (4), the failure intensity of the
Furthermore, the hazard rate of each fault does not software at time t is proportional to the expected number
change over time, but remains constant at . of faults remaining in the software; again, the hazard rate
3. The failures are not correlated, i.e. given n0 and of n individual faults is the constant of proportionality.
the times between failures (t1, t2, ...., tn0) Moreover, many software reliability growth models can
4. Whenever a failure has occurred, the fault that be expressed in a form corresponding to equation (4).
caused it is removed instantaneously and without Their difference often lies in what is assumed about the
introducing any new fault into the software. per-fault hazard rate and how it is interpreted.
B. NHPP models
z t|t = [n M t = [n i 1 (1) As a general class of well developed stochastic process
model in reliability engineering, NHPP models have been
The failure intensity function is the product of the successfully used in studying hardware reliability
inherent number of faults and the probability density problems. These models are also termed as fault counting
of the time until activation of a single fault, na(t), i.e.: models and can be either finite failure or infinite failure
models, depending on how they are specified. In these
= n [1 exp t (2) models, the number of failures experienced so far follows
the NHPP distribution. The NHPP model class is a close
Therefore, the mean value function is relative of the homogenous poisson model, the difference
is that here the expected number of failures is allowed to
m t = n [1 exp t (3) vary with time. Hence, they are useful for both calendar
time data as well as for the execution time data.
It can easily be seen from equations (2) and (3) that
the failure intensity can also be expressed as

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1299

C. Basic assumptions of NHPP models II. RELEGATION OF NHPP MODELS


Some of the basic assumptions (apart from some For binomial type there are a fixed number of faults at
special ones for the specific models discussed) assumed start, while for poisson type, the eventual number of
for NHPP models are as follows: faults could be discovered over an infinite amount of
1. A Software system is subject to failure during time. In poisson process models, there exists a
execution caused by faults remaining in the system. relationship between:
2. The number of faults detected at any time is The failure intensity function and the reliability
proportional to the remaining number of faults in the function
software. The failure intensity function and the hazard rate
3. Failure rate of the software is equally affected by
Mean value function and cumulative distribution
faults remaining in the software.
function (CDF) of the time to failure of an
4. On a failure, repair efforts starts and fault causing
individual fault
failure is removed with certainty.
If the mean value function m(t) is a linear function of
5. All faults are mutually independent from a failure
time, then the process is the Homogeneous Poisson
detection point of view.
Process (HPP), however if it is a non-linear function of
6. The proportionality of failure occurrence/fault
time, then the process is NHPP.
isolation/fault removal is constant.
The earlier SRGMs, known as Exponential SRGMs,
7. Corresponding to the fault detection/removal
were developed to fit an exponential reliability growth
phenomenon at the manufacturer/user end, there exists an
curve. Similar to the J-M model [14], several other
equivalent fault detection/fault removal at the
models that are either identical to the Exponential model
user/manufacturer end.
except for notational differences or are very close
8. The fault detection/removal phenomenon is modeled
approximations were developed by Musa [28],
by NHPP.
Schneidewind [36], and Goel and Okumoto [7]. Also,
However, in practice, some of these assumptions may
some Exponential models were developed to cater for
not hold their ground. Table 1 shows how assumptions
different situations during testing [17, 45]. As a result, we
and notions fail in reality [12, 26, 42, 43].
have a large number of SRGMs each being based on a
D. Comments on using NHPP models particular set of assumptions that suit a specific testing
Among the existing models, NHPP models have been environment.
widely applied by practitioners. The application of NHPP
to reliability analysis can be found in elementary III. MODEL GROUPS
literature on reliability. The calculation of the expected Generally, the SRGMs are classified into two groups.
number of failures/faults up to a certain point in time is The first group contains models, which use machine
very simple due to the existence of the mean value execution time (i.e., CPU time) or calendar time as a unit
function. The estimates of the parameters are easily of fault detection/removal period. Such models are called
obtained by using either the method of maximum Continuous time models. The second group contains
likelihood estimation (MLE) or least squares estimation models which use the number of test cases as a unit of
(LSE). fault detection period. Such models are called discrete
Other important advantages of NHPP models which time models, since the unit of software fault detection
should be highlighted are that NHPPs are closed under period is countable. A large number of models have been
super position and time transformation. We can easily developed in the first group while there are fewer in the
incorporate two or more existing NHPP models by second group. In this section, we explore a broad class of
summing up the corresponding mean value functions. NHPP models based on Continuous and Discrete
The failure intensity of the superposed process is also just distributions. Table 2 categorizes commonly used NHPP
the sum of the failure intensity of the underlying models which show growth in reliability.
processes.

Model Group Example

Continuous-time models
- which use machine execution time (i.e. CPU time) or - Exponential model developed by Goel and Okumoto (G-O) [7]
calendar time as a unit of fault detection/removal period - Delayed S-shaped model due to Yamada et al. [46]
Discrete-time models

- which use the number of test cases as a unit of fault - Exponential model developed by Yamada [47]
detection period - Delayed S-shaped model developed by Kapur et al. [17]

TABLE II. CONTINUOUS AND DISCRETE TIME MODELS

2012 ACADEMY PUBLISHER


1300 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

23 [4 51 4 6
0 % = lim1 (7)
Notation Description 7

a Initial fault-content of the software given (t), the mean value function M(t) =
b Fault removal rate per remaining fault per test case (N(t)] satisfies
a, b Constants, representing initial fault content and rate of
fault removal per remaining , % = 8 0 9 :; (8)
for a software
mf(t) Expected number of failures occurring in the time interval
(0, t] Inversely, knowing m(t), the failure intensity
function (t) can be obtained as:
TABLE III. NOTATIONS OF EXPONENTIAL & DELAYED S-SHAPED
CONTINUOUS-TIME MODELS <'
0 % = (9)
<
A. Continuous-time models
Generally, by using a different non-decreasing
A very large number of Continuous time models have
function m(t), we obtain different NHPP models.
been developed in the literature to monitor the fault
Define the number of remaining software
removal process which measure and predict the reliability
failures at time t by N(t) and we have that:
of the software systems. During the testing phase, it has
been observed that the relationship between the testing = % = $ $ %
$ (10)
time and the corresponding number of faults removed is
either exponential or S-shaped or a mix of two [1].
Let [N(t), t 0] denote a discrete counting process where, N( ) is the number of faults which can
representing the cumulative number of failures be detected by infinite time of testing. It follows
experienced (fault removed) up to time, t, i.e. N(t) is said from the standard theory of NHPP that
= % is poisson with parameter
distribution of $
to be an NHPP with intensity function (t), and it satisfies
the following conditions: [m( ) m(t)], that is:
1. There are no failures experienced at time t = 0,
Pr[$= % =&
i.e. N(t = 0) = 0 with probability 1.
2. The process has independent increment, i.e., the [m m t )
= exp+, , % -
number of failures experienced in (t, t + t], i.e., &!
N(t + t) N(t), is independent of the history. & 0 (11)
Note this assumption implies the Markov
property that the N(t + t) of the process The reliability function at time t0 is exponential,
depends only on the present state N(t) and is given by:
independent of its past state N(x), for x < t.
3. The probability that a failure will occur during R t|t = exp +m t + t m t - (12)
(t, t+t] is (t)t+0(t), i.e., Pr[N(t+t) N(t) =
1] = (t) + 0(t). Note that the function 0(t) is The above conditional reliability function is
defined as: called a software reliability function based upon
an NHPP for a Continuous SRGM.
lim =0


(5) Continuous-time exponential models - G-O
model [7] captures many software reliability issues
In practice, it implies that the second or higher without being overly complicated. It is similar to the
order effects of t are negligible. J-M model except that failure rate decreases
4. The probability that more than one failure will continuously in time. This is a parsimonious model
occur during (t, t + t) is 0(t), i.e. Pr[N(t +t) whose parameters have a physical interpretation, and
N(t) > 1] = 0(t). can be used to predict various quantitative measures
for software performance assessment.
Based on the above NHPP assumptions, it can be According to basic assumption 8, m( ) follows a
shown that the probability that N(t) is a given poisson distribution with expected value N. Therefore,
integer k, is expressed by: N is the expected number of initial software faults as
compared to the fixed but unknown actual number of
[' (
Pr[$ % = & = exp+, % - , & 0 (6) initial software faults n0 in the J-M model.
)!
Basic assumption 2 states that the failure intensity at
time t is given by:
The function m(t) is a very useful descriptive
measure of failure behavior. The function (t) <'
which is called the instantaneous failure = [$ , % (13)
<
intensity is defined as:

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1301

As in the J-M model, the failure intensity is the product Kapur [17] proposed a discrete time model based on the
of the constant hazard rate of an individual fault and concept that the testing phase has two different processes
the number of expected faults remaining in the namely, fault isolation and fault removal. Kapur et al.
software. [19] further proposed a discrete time model based on the
Following differential equation results from basic assumption that the software consists of n different types
assumption 3: of faults and each type of fault requires a different
strategy to remove the cause of the failure due to that
, % = C D , % (14) fault. Kapur et al. [18] also proposed a discrete time
model with a discrete Rayleigh testing effort curve.
Solving the first order linear differential equation (14) In addition to basic assumptions 1, 3 and 8, Kapur et
with the initial condition m(t = 0) = 0 gives the al. [17-19] assumes the following for Discrete time
following mean value function for NHPP: models:
1. Each time a failure occurs, an immediate (delayed)
m(t) = a(1 exp(bt)) (15) effort takes place to decide the cause of the failure in
order to remove it.
The mean value function given in equation (15) is 2. The debugging process is perfect - To obtain a
exponential in nature and does not provide a good fit to realistic estimate of the residual number of faults, and the
the S-shaped growth curves that generally occur in reliability, it is necessary to amend the assumption of
software reliability. But the model is popular due to its instantaneous and perfect debugging. A number of
simplicity. researchers have recognized this shortcoming, and have
attempted to incorporate explicit debugging into some of
Continuous-time delayed S-shaped model - The the software reliability models. Dalal [4] assumes that the
model proposed by Yamada et al. [46] is a descendant software debugging follows a constant debugging rate,
of the G-O model [7], the data requirements being and incorporates debugging into an exponential order
similar and the assumptions being similar with one statistics software reliability model. Schneidewind [40],
exception. Yamada et al. reasoned that due to learning [39], [38] incorporates a constant debugging rate into the
and skill improvements of the programmers during Schneidewind software reliability model [37]. Gokhale et
the debugging phase of the development cycle, the al. [8] incorporates explicit repair into SRGM using a
error detection curve is often not exponential but numerical solution. Imperfect debugging also affects the
rather S-shaped. residual number of faults, and in fact at times can be a
major cause of field failures and customer dissatisfaction.
<'E Imperfect debugging has also been considered by other
= C+D ,F % } (16)
< researchers [7], [9-10].
<'
During the software testing phase, software systems
= C+,F % , % - (17) are executed with a sample of test cases to detect /
<
remove software faults which cause software failures. A
Solving equation (16) and (17) with initial conditions discrete counting process [N(n), n 0] is said to be an
mf (t = 0) = 0 and m(t = 0) = 0, we obtain the mean NHPP with mean value function m(n), if it satisfies the
value function as: following conditions.
1. There are no failures experienced at n = 0, i.e. N(n =
m(t) = a(1 (1 + bt) exp (bt)) (18) 0) = 0.
2. The counting process has independent increments,
alternatively the model can also be formulated a one- that is, for any collection of the numbers of test cases
stage process directly as follows: n1, n2, ..., nk, where (0 < n1,< n2 < ... < nk).
The k random variable N(n1),N(n2),N(n1),
<' ...,N(nk)N(nk1) are statistically independent.
=C % D, % (19)
< For any number of test cases ni and nj , where (0 ni
nj), we have:
GH
Where b(t) =
5G
It is observed that b(t) b as t . This model was PrI$JKL M $ KN = OP =
['JQR M ' QS T
specifically developed to account for lag in the failure exp +,JKL M , KN - , O 0 (20)
observation and its subsequent removal. This kind of U!

derivation is peculiar to software reliability only.


The mean value function m(n) which is bounded above
B. Discrete-time models and is non-decreasing in n represents the expected
Yamada and Osaki [47] proposed two classes of accumulative number of faults detected by n test cases.
Discrete Time Models. One class describes an error Then the NHPP model with m(n) is formulated by:
detection process in which the expected number of errors
[' Q T
detected per test case is geometrically decreasing while Pr[$ K = O = exp +, K - , O 0 (21)
U!
the other class is proportional to the current error content.

2012 ACADEMY PUBLISHER


1302 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Notation Description
Solving the above difference equation using the
probability generality function (PGF) with initial
a Initial fault-content of the software condition m(n = 0) = 0, one can obtain the closed form
b Fault removal rate per remaining fault per test case solution as:
m(n) The expected mean number of faults removal by the nth Q
test case , K = D 1 1C (28)
mf(n) The expected mean number of faults caused by the nth test
case The above mean value function is exponential in
nature and does not provide a good fit to the S-shaped
TABLE IV. NOTATIONS OF EXPONENTIAL & DELAYED S-SHAPED growth curves that generally occur in software
DISCRETE-TIME MODELS
reliability. Next, we briefly discuss below an S-shaped
As a useful software reliability growth index, the fault model.
detection rate per fault (per test case) after the nth test
case is given by: Discrete-time delayed S-shaped model - In the
model developed by Kapur et al. [17], the testing
[' Q5 ' Q
V K = , K 0 (22) phase is assumed to have two different processes
[' W ' Q
namely, fault isolation and fault removal processes.
Accordingly, we have two difference equations:
where m( ) represents the expected number of faults
to be eventually detected. ,F K + 1 ,F K = C D ,F K (29)
Let $= (n) denote the number of faults remaining in the
system after the nth test case is given as:
, K + 1 , K = C ,F K , K (30)
= K =$ $ K
$ (23)
Solving the above difference equation (29) and (30)
=(n) is given by:
The expected value of $
using PGF with initial conditions mf (n = 0) = 0 and
m(n = 0) = 0 respectively, one can obtain the closed
form solution as:
K =, , K (24)
Q
=(n). , K = D[1 1 + CK 1 C (31)
which is equivalent to the variance of $
Suppose that nd faults have been detected by n test
=(n), given that Alternatively the model can also be formulated a one-
cases. The conditional distribution of $ stage process directly as follows:
N(n) = nd, is given by:
G H Q Q5
= K = Y|$ K = K< - = +Z Q
Pr+$
-[
exp [+ K -
, K + 1 , K = D, K (32)
(25) 5GQ
\!

G H Q Q5
which means a poisson distribution with mean (n), It is observed that, b and n . This
5GQ
independent of nd. model was specifically developed to account for lag in
the failure observation and its subsequent removal.
Now, the probability of no faults detected between the
nth and (n + h)th test cases, given that nd faults have IV. EXTENSIONS OF NHPP MODELS
been detected by r test case, is given by:
Some NHPP models depict exponential reliability
_ growth whereas others show S-Shaped growth, depending
^ = exp[+, K + , K - , K, 0 (26)
Q on the nature of growth phenomenon during testing. They
are broadly classified into two categories. If the growth is
The above conditional reliability function, called the uniform, generally Exponential models have been used
software reliability function, is based on an NHPP for and for non-uniform growth, S-shaped models have been
a Discrete SRGM and is independent of nd. developed. As S-shapedness in reliability can be ascribed
Discrete-time exponential model - Based on the to different reasons, many models exist in the literature,
previously mentioned assumptions for Discrete at times leading to confusion in model selection from the
models, Yamada and Osaki [47] showed that the models available.
expected cumulative number of faults removed Initially, Goel and Okumoto [7] proposed the time
between the nth and the (n+1)th test cases is dependent failure rate model based on NHPP. Later, Goel
proportional to the number of faults remaining after modified his original model [6] by introducing the test
the execution of the nth test run, and satisfies the quality parameter. This is a continuous approximation to
following difference equation: the original Exponential model and is described in terms
of an NHPP process with a failure intensity function that
, K+1 , K =C D, K (27) is exponentially decaying. For all practical purposes, the

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1303

Model Type Example

Flexible models
- Models which can capture variability in either exponential and - Model for module structured software
S-shaped curves - Two types of fault models
- To model the failure phenomenon of software during operation
Enhanced NHPP models

- A model which incorporates explicitly the time-varying test- - Log-logistic model


coverage function in its analytical formulation, and provides for
defective fault detection and test coverage during the testing and
operational phases.
Time-dependent transition probability models
- Models which can be used for both calendar time data as well as - Basic execution model
for the execution time data - Logarithmic model

TABLE V. EXTENSIONS OF NHPP MODELS

G-O and the other models are indistinguishable from the S-shaped model which describes software with two types
Exponential model. The Exponential model can be of faults. Later in the testing phase, Kapur [17] ascribed it
further generalized [13] to simplify the modeling process to the presence of different types of faults in software
by having a single set of equations to represent a number systems.
of important models having the Exponential hazard rate The above SRGMs have been proposed for the testing
function. The overall idea is that the failure occurrence phase and it is generally assumed that the operational
rate is proportional to the number of faults remaining and profile is similar to the testing phase, which may not be
the failure rate remains constant between failures while it the case in practice. Very few attempts have been made to
is reduced by the same amount when a fault is removed. model the failure phenomenon of commercial software
In other cases, where there was a need to fit the during its operational use. One of the reasons for this can
reliability growth by an S-shaped curve, some available be attributed to the inability of software engineering to
hardware reliability models depicting a similar curve measure the growth during the usage of software while it
were used [31]. In the literature, S-shapedness has been is in the market. This is unlike the testing phase where
attributed to different reasons. Ohba and Yamada [33], testing effort follows a definite pattern. Kenney [21]
and Yamada et al. [46] ascribed to it the mutual proposed a model to estimate the number of faults
dependency between software faults whereas latter remaining in the software during its operational use. He
SRGMs were developed taking various causes of the S- has assumed a power function to represent the usage rate
shapedness into account, such as the models developed of the software, though he argues that the rate at which
by Ohba [32], Yamada et al. [46], Kapur et al. [17], commercial software is used is dependent upon the
Kareer et al. [20], Bittanti et al. [3], Kapur and Garg [16], number of its users. Kenney's model however fails to
and Pham [34]. capture the growth in the number of users of the software.
Also, it is important that the SRGM should explicitly
A. Flexible modeling approach
take into account faults of different severity. Such a
In addition to the models discussed above, other NHPP modeling approach was earlier adopted by Kapur et al.
models termed as flexible growth models have been [19]. This approach can capture variability in the growth
developed in the past which can capture variability in curves depending on the environment in which it is being
exponential and S-shaped curves. In this section, we used and at the same time, it has the capability to reduce
present a brief overview of some of these models. either exponential or S-shaped growth curves.
Ohba proposed a Hyper-Exponential model [31] to
describe the fault detection process in module structured B. Enhanced NHPP models
software; Khoshgoftaar [22] proposed the K-stage The Enhanced NHPP model developed by Gokhale
Erhangian model; Xie and Zhao [45] proposed a simple and Trivedi [11] states that the rate at which faults are
model with graphical interpretation; Kapur and Garg [15] detected is proportional to the product of the rate at which
modified the G-O model by introducing the concept of potential fault sites are covered and the expected number
imperfect debugging; Zeephongsekul [49] proposed a of remaining faults. This model allows for time-
model describing use when a primary fault introduces a dependent failure occurrence rate per fault, i.e., the rate at
secondary fault in real life software development which an individual fault will surface can vary with
projects. Non-uniform testing is more popular and hence testing time.
the S-shaped growth curve has been observed in many The NHPP models have constant, increasing or
software development projects. decreasing failure occurrence rates per fault. These
Kareer et al. [20] and Yamada [48] proposed two types models were inadequate to capture the failure processes
of fault models where each fault type is modeled by an S- underlying some of the failure data sets, which exhibit an
shaped curve; Kimura et al. [23] proposed an exponential increasing/decreasing nature of the failure occurrence rate

2012 ACADEMY PUBLISHER


1304 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

per fault. The Log-logistic model was proposed to capture contains a feature for converting from calendar time
the increasing/decreasing nature of the failure occurrence to processing time or vice versa.
rate per fault captured by the hazard of the Log-logistic The mean value function is such that the expected
distribution [24]. The mean value function m(t) in this number of failures is proportional to the expected
case is given by: number of undetected faults at that time i.e., the
cumulative number of failures follows a poisson
` ( process.
, % = D (33)
5 ` (
Gv
, % = C 1 fOg (36)
where and k are constants.
C. Log-normal models where, b0, b1 > 0
Musa himself [29] recommends the use of this model
In his proposed model, Mullen [27] showed that the
(as contrasted to Musas logarithmic poisson model)
distribution of failure rates for faults in software systems
when the following conditions are met:
tends to be lognormal. Since the distribution of event
a) Early reliability is predicted before program
rates tends to be lognormal and faults are just a random
execution is initiated and failure data observed
subset or sample of the events, the distribution of the
b) The program is substantially changing over time as
failure rates of the faults also tends to be lognormal.
the failure data are observed
The probability density function (pdf) of the lognormal
This model can also be used if one is interested in
distribution is given by:
seeing the impact of a new software engineering
hi T jk H
technology on the development process.
:a O = fOg HlH :O O > 0 (34) Logarithmic poisson models - This model [30] is
Ubde
similar to the G-O model except that it attempts to
where, x is the variate, is the mean value of the log consider that later fixes have a smaller effect on a
program's reliability than earlier ones. The model is
of the variate, and n 2 is the variance of the log of the
also called Musa-Okumoto logarithmic Poisson model
variate. The mean value of the variate is exp(+n2/2).
because the expected number of failures over time is a
The median value of the variate is exp(). The mode of
logarithmic function. Thus, the model is an infinite
the variate is exp( n 2) and its variance is exp(2 +
failure model.
n2)exp(n2 1). If x is distributed as L(, n2) then 1/x is
The basic assumption of the model, beyond the
distributed as L(, n2).
assumption that the cumulative number of failures
The cumulative distribution function (cdf) for the
follows a poisson process, is that failure intensity
lognormal in the form of the tabulated integral of the
decreases exponentially with the expected number of
standard normal density function is given as:
failures experienced:
hi T jk pH
U r st U u
8 :a Y|o, n = de 8 W
l fOg H :q =
b
(35) , % = C wK C % + 1 (37)

The ability of the lognormal to fit the empirical failure V. CONCLUSIONS


rate distributions is shown to be superior to that of the Reliability models are a powerful tool for predicting,
gamma distribution (the basis of the Gamma/EOS family controlling and assessing software reliability. As a
of reliability growth models) [2] or a Power-law model. general class of well developed stochastic process
D. Time-dependent transition probability models modeling in reliability engineering, NHPP models have
Some NHPP models are capable of coping with the been successfully used in studying hardware and software
case of non-homogeneous testing and hence they are reliability problems. They are especially useful to
useful for both calendar time data as well as for execution describe failure processes which possess certain trends,
time data [44]. These models are termed Time-dependent such as reliability growth and deterioration, thus making
transition probability models. In these models, the failure the application of NHPP models to software reliability
intensity decreases exponentially with the expected analysis easily implemented.
number of failures experienced. Musa [28] and Musa and In this paper, we first studied the initial model (J-M)
Okumoto [30] proposed the basic execution time model based on Markov process to provide a measurement for
based on the concept of failure observation and the software reliability. These models were later grouped into
corresponding fault removal phenomenon and log NHPP and Bayesian models. We described the modeling
poisson model respectively. process for both Continuous and Discrete time models
based on NHPP. These models can also be classified
Basic execution models - This model is perhaps
according to their asymptotic representation as either
the most popular of the software reliability models
concave or S-shaped. We explored a few commonly used
[5]. The time between failures is expressed in terms of
extensions of NHPP models. Then, we studied the
computational processing units (CPU) rather than the
flexible modeling approach in which the models can be
amount of calendar time that has elapsed. The model
customized as per the need. Finally, we discussed
Enhanced NHPP models and models based on time-

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1305

dependent transition probability to model both execution a simulation approach. IEEE Transactions on Reliability,
and calendar time. The existing NHHP SRGMs can help 55(2):281292, 2006.
remove faults by accelerating the testing effort intensity, [11] S. S. Gokhale and K. S. Trivedi. A Time/Structure Based
and the proper allocation and management of the testing Software Reliability Model. Annals of Software
resources. Engineering, 8:85121, 1999.
A software release decision is often a trade-off [12] D. Hamlet. Are We Testing for True Reliability? IEEE
between early release to capture the benefits of an earlier Software, 9(4):2127, 1992.
market introduction, and the deferral of product release to [13] American Institute of Aeronautics and Astronautics.
enhance functionality, or improve quality. In practice, Recommended Practice for Software Reliability. In
software manufacturers are challenged to find answers to ANSI/AIAA Report 0131992. AIAA, 1992.
questions such as how much testing is needed?; how to [14] Z. Jelinski and P. B. Moranda. Software Reliability
manage the testing resources effectively and efficiently?; Research. In Statistical Computer Performance Evaluation
(Ed.) W. Freiberger, pages 465484, 1972.
when should a product be released?; what is the market
window?; what are the expectations of customers and [15] P. K. Kapur and R. B. Garg. Optimal Software Release
end-users? etc. The decision making process to release a Policies for Software Growth Model Under Imperfect
Debugging. Researche Operationelle/Operations Research
product will normally involve different stakeholders who (RAIRO), 24:295305, 1990.
will not necessarily have the same preferences for the
[16] P. K. Kapur and R. B. Garg. A Software Reliability
decision outcome. A decision is only considered Growth Model for Error Removal Phenomenon. Software
successful if there is congruence between the expected Engineering Journal, 7:291294, 1992.
reliability level and the actual outcome, which sets
[17] P. K. Kapur, R. B. Garg, and S. Kumar. Contributions to
requirements for decision implementation. NHHP Hardware and Software Reliability. World Scientific,
SRGMs can help software practitioners decide if the Singapore, 1999.
reliability of a software product has reached a given [18] P. K. Kapur, M. Xie, R. B. Garg, and A. K. Jha. A Discrete
threshold and to decide when the software system is Software Reliability Growth Model with Testing Effort. 1st
ready for release. International Conference on Software Testing, Reliability
and Quality Assurance, 1994.
REFERENCES [19] P. K. Kapur, S. Younes, and S. Agarwala. A General
Discrete Software Reliability Growth Model. International
[1] Ch. A. Asad, M. I. Ullah, and M. J. Rehman. An Approach
Journal of Modelling and Simulation, 18(1):6065, 1998.
for Software Reliability Model Selection. International
Computer Software and Applications Conference, [20] N. Kareer, P. K. Kapur, and P.S. Grover. An S-shaped
(COMPSAC), pages 534539, 2004. Software Reliability Growth Model With TwoTypes of
Errors. Microelectronics Reliability, 30:10851090, 1990.
[2] P. G. Bishop and R. E. Bloomfield. Using a log-normal
failure rate distribution for worst case bound reliability [21] G. Q. Kenney. Estimating Defects in Commercial Software
prediction. During Operational Use. IEEE Transactions on Reliability,
42(1):107115, 1993.
[3] S. Bittanti, P. Blozern, E. Pedrotti, M. Pozzi, and A.
Scattolini. Forecasting Software Reliability. In G. Goss and [22] T. M. Khoshgoftaar. Non-Homogeneous Poisson Process
J. Hartmanis, editors, A Flexible Modeling Approach in for Software Reliability. COMPSTAT, pages 1314, 1988.
Software Reliability Growth, pages 101140. Springer- [23] M. Kimura, S. Yamada, and S. Osaki. Software Reliability
Verlag, 1988. Assessment for an Exponential S-shaped Reliability
[4] S. R. Dalal and C. L. Mallows. Some graphical aids for Growth Phenomenon. Computers and Mathematics with
deciding when to stop testing software. IEEE Trans. on Applications, 24:7178, 1992.
Software Engineering, 8(2):169175, 1990. [24] L. M. Leemis. Reliability-Probabilistic Models and
[5] W. Farr. Software Reliability Modeling Survey. In M. R. Statistical Methods. Prentice-Hall, 1995.
Lyu, editor, Handbook of Software Reliability [25] B. Littlewood. Forecasting Software Reliability. In G.
Engineering, pages 71118. McGraw-Hill, Inc., 1996. Goss and J. Hartmanis, editors, Software Reliability
[6] A. L. Goel. Software Reliability Models: Assumptions, Modeling and Identification, chapter 5, pages 141209.
Limitations and Applicability. IEEE Transactions on Springer-Verlag, 1987.
Software Engineering, pages 14111423, 1985. [26] H. Hecht M. Hecht, D. Tang and R. W. Brill. Quantitative
[7] A. L. Goel and K. Okumoto. Time-Dependent Error Reliability and Availability Assessment for Critical
Detection Rate Model for Software Reliability and other Systems Including Software. 12th Annual Conference on
Performance Measures. IEEE Transactions on Reliability, Computer Assurance, pages 147158, 1997.
R-28(3):206211, 1979. [27] R. Mullen. The Lognormal Distribution of Software
[8] S. Gokhale, M. R. Lyu, and K. S. Trivedi. Analysis of Failure Rates: Origin and Evidence. The Ninth
software fault removal policies using a non homogeneous International Symposium on Software Reliability
continuous time markov chain. Software Quality Journal, Engineering (ISSRE), pages 124133, 1998.
pages 211230, 2004. [28] J. D. Musa. A Theory of Software Reliability and its
[9] S. Gokhale, P. N. Marinos, K. S. Trivedi, and M. R. Lyu. Applications. IEEE Transactions on Software Engineering,
Effect of repair policies on software reliability. Proc. of 1(3):312327, 1975.
Computer Assurance (COMPASS), pages 105116, 1997. [29] J. D. Musa, A. Iannino, and K. Okumoto. Software
[10] S. S. Gokhale, M. R. Lyu, and K. S. Trivedi. Incorporating Reliability: Measurement, Prediction, Application.
fault debugging activities into software reliability models: McGraw-Hill, Inc., USA, 1987.

2012 ACADEMY PUBLISHER


1306 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[30] J. D. Musa and K. Okumoto. A Logarithmic Poisson High Assurance Systems Engineering (HASE), pages 139
Execution Time Model for Software Reliability 148, 2004.
Measurement. International Conference on Software [41] J. G. Shanthikumar. Software Reliability Models: A
Engineering, (ICSE), pages 230238, 1984. Review. Microelectronics Reliability, 23:903949, 1983.
[31] M. Ohba. Software Reliability Analysis Models. [42] J. A. Whittaker. What Is Software Testing? And Why Is It
Nontropical Issue, pages vol. 28, Number 4, pp 428, 1984. So Hard? IEEE Software, pages 7079, 2000.
[32] M. Ohba. Inflection S-shaped Software Reliability Growth [43] A. Wood. Software Reliability Growth Models:
Model. In S. Osaki and Y. Hotoyama, editors, Lecture Assumptions Vs. Reality. International Symposium on
Notes in Economics and Mathematical System, pages 101 Software Reliability Engineering (ISSRE), pages 136143,
140. Springer-Verlag, 1988. 1997.
[33] M. Ohba and S. Yamada. S-shaped Software Reliability [44] M. Xie. Software Reliability Modeling. World Scientific,
Growth Model. 4th International Conference on Reliability Singapore, 1991.
and Maintainability, pages 430436, 1984.
[45] M. Xie and M. Zhao. On Some Reliability Growth Models
[34] H. Pham. Handbook of Reliability Engineering. Springer- With Simple Graphical Interpretations. Microelectronics
Verlag London limited, USA, 2003. Reliability, 33(2):149167, 1993.
[35] G. J. Schick and R. W. Wolverton. An Analysis of [46] S. Yamada, M. Ohba, and S. Osaki. S-shaped Reliability
Competing Software Reliability Models. IEEE Trans- Growth Modeling for Software Error Detection. IEEE
actions on Software Engineering, 4(2):104120, 1978. Transactions on Reliability, R-32:475478, 1983.
[36] N. F. Schneidewind. Analysis of Error Processes in [47] S. Yamada and S. Osaki. Discrete Software Reliability
Computer Software. Sigplan Notices, 10:337346, 1975. Growth Models. Applied Stochastic Models and Data
[37] N. F. Schneidewind. Software reliability model with Analysis, 1:6577, 1985.
optimal selection of failure data. IEEE Trans. On Software [48] S. Yamada, S. Osaki, and H. Narihisa. Software Reliability
Engineering, 19(11):10951014, 1993. Growth Models With Two Types of Errors. Researche
[38] N. F. Schneidewind. Modeling the fault correction process. Operationelle/Operations Research (RAIRO), 19:87104,
Proc. of Intl. Symposium on Software Reliability 1985.
Engineering (ISSRE), pages 185191, 2001. [49] P. Zeephongsekul, C. Xia, and S. Kumar. A Software
[39] N. F. Schneidewind. An integrated failure detection and Reliability Growth Model Primary Errors Generating
fault correction model. Proc. of Intl. Conference on Secondary Errors under Imperfect Debugging. IEEE
Software Maintenance, pages 238241, 2002. Transactions on Reliability, R-43(3):408413, 1994.
[40] N. F. Schneidewind. Assessing reliability risk using fault [50] D. R. Jeske and X. Zhang. Some Successful Approaches to
correction profiles. Proc. of Eighth Intl. Symposium on Software Reliability Modeling in Industry. The Journal of
Systems and Software, 74:8599, 2005.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1307

Confidence Estimation for


Graph-based Semi-supervised Learning
Tao Guo
Visual Computing and Visual Reality Key Laboratory of Sichuan Province, Chengdu, China
Email: tguo35@gmail.com

Guiyang Li
College of Computer Science, Sichuan Normal University, Chengdu, China
Email: guiyang.li@gmail.com

AbstractTo select unlabeled example effectively and each sample set can be divided into two distinct subsets.
reduce classification error, confidence estimation for graph- Each of the subsets is sufficient for learning if there is
based semi-supervised learning CEGSL is proposed. sufficient labeled example. Then the two subsets are
This algorithm combines graph-based semi-supervised conditionally independent given the class attribute. Two
learning with collaboration-training. It makes use of classifiers iteratively trained on one subset and they teach
structure information of sample to calculate the
each other with a respective subset of unlabeled example
classification probability of unlabeled example explicitly.
With multi-classifiers, the algorithm computes the and their highest confidence predictions. Since co-
confidence of unlabeled example implicitly. With dual- training requires two sufficient and redundant views, such
confidence estimation, the unlabeled example is selected to a requirement can hardly be met in most scenarios [7].
update classifiers. The comparative experiments on UCI Goldman and Zhou proposed an improved co-training
datasets indicate that CEGSL can effectively exploit algorithm [8]. It employs time-consuming cross
unlabeled data to enhance the learning performance. validation technique to determine how to label the
Index Termsgraph, collaboration-training, confidence, unlabeled examples and how to produce the final
classification, semi-supervised leaning, hypothesis [9]. In 2005, Zhou and Li proposed a new co-
training style algorithm named tri-training [10]. It is easy
to be applied to common data mining application.
I. INTRODUCTION
However, the performance of this algorithm goes
Applications such as web search, pattern recognition, degradation in some circumstances and exists three issues:
text classification, genetic research are examples where (1) estimation for classification error is unsuitable. (2)
cheap unlabeled data can be added to a pool of labeled excessively confined restriction introduce more
samples. In these applications, a large amount of labeled classification noise. (3) differentiation between initial
data should be available for building a model with good labeled example and labeled unlabeled example is
performance. During past decade, many supervised deficient [11]. Zhan [12] proposed an algorithm called
learning algorithms (e.g. J4.8, Bays and SVM) have been co-training semi-supervised active learning with noise
developed and extensively learned use labeled data. filter. In this algorithm, three fuzzy buried Markov
Unfortunately, it is often the case that there is a limited models are used to perform semi-supervised learning
number of labeled data along with a large pool of cooperatively. Some human-computer interactions are
unlabeled data in many practices [1]. It is noteworthy that actively introduced to label the unlabeled sample at
a number of methods called semi-supervised learning certain time. The experimental results show that the
have been developed for using unlabeled data to improve algorithm can effectively improve the utilization of
the accuracy of prediction [2]. It has received unlabeled samples, reduce the introduction of noise
considerable attention in the machine learning literature samples and raise the accuracy of expression recognition.
due to its potential in reducing the need for expensive But human interaction will reduce the efficiency of the
labeled data. Early methods in semi-supervised learning algorithm. In this paper, an explicit confidence estimation
were using mixture models and extensions of the EM for graph-based semi-supervised learning algorithm
algorithm [3]. More recent approaches belong to one of (CEGSL) is proposed. This algorithm makes use of
the following categories: self-training, transductive structure of sample data to calculate the classification
SVMs, co-training, split learning, and graph-based probability of unlabeled example explicitly. Combining
methods [4]. with co-training, this algorithm computes the confidence
Co-training is a prominent approach in semi- of unlabeled example implicitly with three classifiers and
supervised learning proposed by Blum and Mitchell [5]. to select unlabeled example efficiently.
It requires two sufficient and redundant views to learning The rest of the paper is structured as follows: Section 2
[6]. In this algorithm, it assumes that the description of describes graph-based semi-supervised learning. Section

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1307-1314
1308 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

3 introduces the proposed algorithms. Section 4 shows The flow diagram of the algorithm proposed in this
experimental and comparative results in different UCI paper is shown in Figure 1.
data sets. Section 5 makes concludes.
Input labeled and unlabeled data
II. GRAPH-BASED SEMI-SUPERVISED LEARNING
Graph-based semi-supervised learning algorithm Calculate
makes use of example sets and similarity to create a similarity matrix
diagram. The nodes in the graph correspond to example.
The weight of edge represents similarity that connects
two examples. Graph-based semi-supervised learning Classifier 1 Classifier 2 Classifier 3
problem is a regular optimization problem. Definition of
the problem includes the objective function needed to
optimize and regular items defined by decision function.
It solves the problem by optimizing the parameters of Update classifiers?
optimal model. Decision function for the model has two
properties: (1) the output label from unlabeled example Y
tries to match that from labeled example. (2) the whole Calculate N
graph satisfies smoothness. Graph-based semi-supervised confidence of
learning algorithm uses the popular assumption directly unlabeled data based
or indirectly. The assumption requires similar labels in a on similarity matrix
small local region and it also reflects local smoothness of
decision function. Under this assumption, a large number Select unlabeled data
of unlabeled examples make the space of example more with high confidence Output final
compact, thus it can indicate characteristic of local region to update classifiers classifiers
more accurately and makes the decision function fit the
data better. The training stage of CEGSL
The target function of graph-based semi-supervised A.. Description of CEGSL Algorithm
learning algorithm includes two parts, loss function and Given data set R {X1X2Xn } , it includes
regular items. Different algorithm selects different loss
function and regular item. Zhu X J[13] proposed a semi- labeled and unlabeled examples. Assuming nl in R are
supervised learning algorithm with harmonic function of labeled examples, its data set Yl { yl1 , yl 2, yl n } ;
Gaussian random occasions in 2003. This method is a l

continuous relaxation method for discrete Markov. The nu n nl are unlabeled examples and its data set
loss function in objective function is a quadratic function
Yu { yu1 , yu 2, yu nu } .The entire data set Y {Yl , Yu } .
with infinite weight. Regular item is a combinational
Laplacian based on graph. Although a variety of graph- CEGSL algorithm consists of following steps. First,
based semi-supervised learning algorithm set the reading examples to built a graph with labeled and
objective function differently, they can be concluded to unlabeled examples as vertex and the similarity between
formula (1) examples as edge. Then, re-sampling labeled example set
L with Bootstrap to built initialized training set for three
n n
F ( y) wi , j ( yi y j )2
classifiers. For each classifier, the other two classifiers
1 are auxiliary classifiers in each iteration. They classify
i 1 j 1 the examples which are in unlabeled example set U and
Where y represents prediction labels for unlabeled put the identified examples and their labels into a buffer.
The confidence is calculated explicitly using the graph.
examples, wi , j represents matrix of weight in graph. The The unlabeled examples with high confidence are put into
objective of graph-based semi-supervised learning is to training set. The main classifier is adjusted until the
optimize F ( y ) and obtain optimal parameter of model. classification errors of the three classifiers are not
reduced. Finally, the algorithm is terminated. Figure 2
III. CONFIDENCE ESTIMATION FOR GRAPH-BASED SEMI- shows the procedure of CEGSL algorithm.
SUPERVISED LEARNING
The labeled example used by CEGSL is defined as
L {( x1 , y1 ), ( x2 , y2 ),
( x|L| , y|L| )} , ( xi , yi ) represents
CEGSL algorithm combines the advantages of semi-
supervised learning and collaboration-training algorithms. that the label for example xi is yi ( yi {1, 1} ). A
It uses three classifiers to perform collaborative training large number of unlabeled example is defined as
and compare the confidence of unlabeled examples U { x1 , x2 , , x|U |} , | L | | U . Sampling the labeled
implicitly. In order to select more reliable unlabeled
example to join to training set, it makes use of structure example and initialized the three classifiers, we get three
information of examples to calculate the classification classifiers hi 1 i 3 . The buffer (i ) 1 i 3
probability of unlabeled examples explicitly. are used to save the unlabeled examples with same voting

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1309

by other two auxiliary classifiers. There are two present the similarity between examples. Then, the
requirements for terminating the algorithm: number of algorithm searches the labels for unlabeled example by
iterations is greater than specified number K or classifier minimizing the labels and the inconsistency of the graph.
error rata ei increases. The inconsistency is defined as:
n n

input labeled example set Lunlabeled example set F(y)= Si , j ( yi y j )2 Y T LY 2


i 1 j 1
Uiteration number K
outputfinal classifiers hi 1 i 3 Where S i , j is the similarity matrix with nn . L
1. calculating the similarity between any two
represents non-normalized graph Laplacian. For a graph
examples in labeled and unlabeled example set constructed by labeled and unlabeled examples, the label
2. randomly sampling three data sets from labeled for unlabeled example is calculated by minimizing F(y ) .
data set for initializing classifiers hi Since the regular graph-based semi-supervised learning
3. calculating pi , qi ,label zi and confidence can only calculate the labels for unlabeled example
pi qi for each unlabeled example directly, this paper modifies the algorithm by referencing
[15].The target function F ( S , y ) includes two parts, one
4. for each classifier, the rest of two are used
as auxiliary classifier to vote. The unlabeled is used for calculating the inconsistence Fl ( S , y )
data with same voting are put into buffer (i ) between labeled and unlabeled examples, the other is
5. updating the classifiers with the unlabeled used to compute the inconsistency Fu ( S , yu ) between
example, which has a high confidence pi qi unlabeled examples. There are two criteria needed to be
6. terminating algorithm when number of iterations satisfied when distributing label for unlabeled example:
(1) the two unlabeled examples with high similarity have
is greater than specified number K or classifier
the same label. (2) the unlabeled examples own same
error rata ei increases, otherwise returning to label with the labeled example when they have high
step 3 similarity with labeled example. The inconsistency
Fu ( S , y) is defined as:
Figure 1. The procedure of explicit confidence estimation for graph-
based semi-supervised learning algorithm nu

In the algorithm, step3 and stet4 keep the quality of


Fu ( S , yu ) S
i , j 1
i, j ( yiu y ju ) 2 (3)

selected unlabeled examples; step5 performs the selection


of unlabeled example. When selecting the number of The inconsistency Fl ( S , y) between labeled and
unlabeled examples, if more unlabeled examples are unlabeled examples is defined as:
selected, it will increase the introduction of possibility of
nl nu
Fl ( S , y ) Si , j ( yil y ju ) 2
noise. If the selected example set is small, the
convergence rate will be affected. After repeated (4)
i 1 j 1
experiments, the algorithm takes top 10% unlabeled
examples to help the training of classifiers with better then, the target function is defined as:
achievement. Also, the number of iterations K is set to 20
in this experiment. Since the calculation for classification F ( S , y ) = Fl ( S , y) + CFu ( S , yu ) (5)
error of unlabeled example is more different, this paper
assumes that there is a same distribution for labeled and Where C is a constant and used to evaluate the
importance of Fu . When minimizing F ( S , y ) , a suitable
unlabeled examples. The classification error rate ei is
label can be found. h( xi ) represents prediction label for
defined as number of error classification for labeled xi , then the target function is:
examples/number of same labeled examples. Similarity
min F ( S , y ) s.t. h( xi ) yli , i 1, 2, ,nl (6)
/ ) , in which
2
Sij is defined as: Si , j exp( xi x j
2

Put formula (4) and (5) into (6), the target function is
is a constant and the RBF is used to calculate the expressed as formula (7)
similarity. nl nu

B. Graph-based explicit confidence estimation for min F ( S , y ) = min S i, j


( yil y ju ) 2 +
i 1 j 1
unlabeled examples
nu
Graph-based semi-supervised learning is an important C Si , j ( yiu y ju ) 2
breach in the research of semi-supervised learning. i , j 1
Representative algorithm includes Label Propagation
Algorithm [14] and Graph Mincut Algorithm[15].It uses s.t. h( xi ) yli , i 1, 2,,nl (7)
graph to present the relationship between data, nodes in
the graph present examples and edges between nodes

2012 ACADEMY PUBLISHER


1310 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

To calculate the confidence of unlabeled example, 0.4


formula (7) is modified to (8)

Classification error rate


Co-training
nu Tri-training
min F ( S , y ) = ( pi qi )
0.3 CEGSL
(8)
i 1
0.2
Where
nl nu
C
S
0.1
pi Si , j ( hi y j ) ( y j ,1) ( hi h j )
2 2
i, j
(9)
j 1 2 j 1

0.0
nl nu
1 2 3 4 5 6 7
C Number of iteration
qi S i, j
( hi y j ) ( y j , 1)
2

2
S i, j
( hi h j ) (10)
2

a 80% unlabel rate


j 1 j 1

When x y , ( x , y ) =1 or ( x, y ) =0.
0.4

Classification error rate


Co-training
pi and qi are calculated through formula (9) and (10). Tri-training
0.3 CEGSL
pi and qi represents confidences belonging to different
labels for unlabeled example xi respectively. The label of 0.2

unlabeled example is computed by using sign( pi qi ) .


0.1
The confidence for this label is pi qi .
0.0
IV. EXPERIMENTS AND ANALYSIS 1 2 3 4 5 6 7
Number of iteration
Four UCI data sets are used in this experiment.
Detailed information on these data sets are tabulated in b 60% unlabel rate
Table I .The data set used in the comparative experiments 0.4
includes two sets of credit card data sets-Australian and
Classification error rate

Co-training
German; two sets of medical diagnostic data set-breast- Tri-training
0.3 CEGSL
cancer and diabetes.

TABLE I. Basic INFORMATION for data sets 0.2

Australian German Breast-cancer Diabetes


0.1
size 690 1000 699 768
attribute 14 20 11 8
0.0
class 2 2 2 2 1 2 3 4 5 6 7
Number of iteration
For each data set, about 25% data are kept as test c 40% unlabel rate
examples while the rest are used as the pool of training 0.4
examples. L and U are partitioned under different
Classification error rate

Co-training
unlabeled rates including 20%, 40%, 60%, 80%. For Tri-training
example, assuming a set contains 1000 examples, 250 0.3 CEGSL
examples are used as test examples. The rest of 750
examples are kept as training examples. When the
0.2
unlabeled rate is 20%, 600 examples are put into L with
their labels while the remaining 150 examples are put into
U without their labels. The experiment will compare the 0.1
performance under different percentage of training data.
The experiment includes two groups. It takes BP
0.0
neural networks and ID3 decision tree as a classifier 1 2 3 4 5 6 7
respectively, the performance of CEGSL algorithm is Number of iteration
compared with two semi-supervised learning algorithms, d 20% unlabel rate
i.e. Tri-training and Co-training.
Figure 2. Average classification error rate comparison with BP neural
network

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1311

0.4
Figure 3 and Figure 4 give the plots of the average
classification error rates versus the learning iterations
Classification error rate

Co-training
Tri-training
before the algorithm stops. The error rates of the
0.3 CEGSL compared algorithms are also depicted in Figure 3 and
Figure 4. The semi-supervised learning algorithms, three
single classifiers with BP neural network and ID3
0.2
decision tree are trained from only the labeled training
examples, i.e. L. The average error rate of the single
0.1 classifiers is shown as a vertical line in each figure, the
iteration number for each algorithms is shown as a
horizontal line.
0.0
1 2 3 4 5 6 7
In detail, Figure 3(a)~(d) show the average of
Number of iteration classification error rate during the iterative process when
BP neural networks is used under all data sets. From the
a 80% unlabel rate
results, we may see that the tri-training can effectively
0.4 reduce the classification error only in the first two or
three rounds. With the further iterations, the classification
Classification error rate

Co-training
Tri-training error rate has a greater increasing. Since there is no
0.3 CEGSL
effective way to prevent the introduction of noise data,
the noise data will continue to accumulate during the
0.2 iteration of the algorithm. Therefore, it will give a
negative impact on tri-training, especially in the case of
0.1
less labeled example [16]. Moreover, when the co-
training is used, the introduction of noise data can be
prevented in a certain extent by using the 10 cross-
0.0 validation. Figure 3 reveals that on all the subfigures, the
1 2 3 4 5 6 7
Number of iteration final hypotheses generated by CEGSL are better than the
initial hypotheses. Comparing with the other two
b 60% unlabel rate algorithms, the final hypotheses of CEGSL are almost
0.4 always better after first two or three iterations.
When ID3 decision tree is used, Figure 4 (a)~(d) also
Classification error rate

Co-training
Tri-training show the average of classification error rate during the
0.3 CEGSL iterative process. It could be observed from the figures
that the line of CEGSL is always below those of the other
0.2 compared algorithms after first two or three rounds. But,
the error rate of CEGSL keeps on decreasing when
utilizing more unlabeled example, and converges quickly
0.1
within just a few learning iterations. From subfigure (a)
to (d), CEGSL keeps comparable with all the classifiers
0.0 under all the unlabel rates
1 2 3 4 5 6 7
Number of iteration
From Figure 3 to Figure 4, we may found that on all
the subfigures, the final hypotheses generated by CEGSL
c 40% unlabel rate are better than the initial hypotheses. It confirms that
0.4 CEGSL can effectively exploit unlabeled examples to
enhance the learning performance.
Classification error rate

Co-training
Tri-training The comparative results are also summarized in Table
0.3 CEGSL II to Table V, which present the classification error rate
of hypothesis, the final hypothesis generated by CEGSL
0.2 and the improvement of the latter over the former under
80% , 60%, 40%, 20% unlabel rate. The biggest
improvements achieved by each algorithm have been
0.1 boldfaced in Tables.
Tables II to VI show that CEGSL algorithm can
0.0
effectively improve the hypotheses with BP neural
1 2 3 4 5 6 7 network and ID3 decision tree under all the unlabel rates.
Number of iteration In fact, if the improvements are averaged across all the
d 20% unlabel rate data sets, classifiers and unlabel rates, it can be found that
Figure 3. Average classification error rate comparison with ID3
the average improvement of CEGSL is 5.33% with BP
decision tree neural network and 4.65% with ID3 decision tree. It is
impressive that with all the classifiers and under all the

2012 ACADEMY PUBLISHER


1312 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

unlabel rates, CEGSL achieved the biggest average tree are used, CEGSL has 3 wining data sets respectively
improvement. Moreover, Tables II to V also show that if while Tri-training has 1 wining data set respectively;
the algorithms are compared through counting the Under 40% unlabel rate, CEGSL has 4 wining data sets
number of winning data sets, CEGSL is almost always when BP neural network is used; when ID 3 decision tree
the winner. is used, CEGSL has 3 wining data sets while co-training
In detail, under 80% unlabel rate, CEGSL has 4 has 1 wining data sets. Under 20% unlabel rate, when BP
wining data sets when BP neural network is used; when neural network is used, CEGSL only has 2 wining data
ID3 decision tree is used CEGSL has 3 wining data sets sets while co-training has 2 wining data sets; when ID3
while co-training has 1 wining data set. Under 60% decision tree is used CEGSL has 3 winging data sets and
unlabel rate, when BP neural network and ID3 decision co-training has 1 wining data set.

TABLE II. THE CLASSIFICATION ERROR RATES OF THE INITIAL AND FINAL HYPOTHESES AND THE CORRESPONDING IMPROVEMENTS OF CEGSL,
TRI-TRAINING AND CO-TRAINING UNDER 80% UNLABEL RATE

BP
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 18.62 13.72 4.9 16.83 15.27 1.56 17.26 14.28 2.98
German 22.16 17.28 4.9 17.92 16.22 1.7 19.37 16.33 3.04
Breast-cancer 19.27 12.22 7.1 18.27 16.38 1.89 18.22 14.57 3.65
Diabetes 14.26 9.77 4.5 13.21 10.37 2.84 15.37 11.26 4.11
average 18.58 13.25 5.33 16.56 14.56 2.00 17.56 14.11 3.45
ID3
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 17.25 13.72 3.5 17.35 14.37 2.98 18.26 14.26 4
German 20.01 14.08 5.9 18.33 15.33 3 16.27 14.23 2.04
Breast-cancer 18.97 12.36 6.6 19.21 17.39 1.82 17.35 14.27 3.08
Diabetes 12.31 9.77 2.5 12.67 10.27 2.4 12.36 11.75 0.61
average 17.14 12.48 4.65 16.89 14.34 2.55 16.06 13.62 2.43

TABLE III. THE CLASSIFICATION ERROR RATES OF THE INITIAL AND FINAL HYPOTHESES AND THE CORRESPONDING IMPROVEMENTS OF CEGSL,
TRI-TRAINING AND CO-TRAINING UNDER 60% UNLABEL RATE

BP
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 15.23 9.27 6 14.53 13.37 1.16 16.25 12.97 3.28
German 17.16 11.39 5.8 20.55 18.66 1.89 19.33 17.66 1.67
Breast-cancer 15.79 12.63 3.2 15.79 13.28 2.51 16.76 15.32 1.44
Diabetes 10.33 7.95 2.4 11.27 8.25 3.02 11.37 9.26 2.11
average 14.63 10.31 4.32 15.54 13.39 2.15 15.93 13.80 2.13
ID3
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 15.33 10.33 5 15.07 14.09 0.98 15.37 13.98 1.39

German 16.89 12.67 4.2 21.97 17.38 4.59 18.39 16.27 2.12

Breast-cancer 17.21 10.27 6.9 16.33 12.33 4 19.25 14.33 4.92

Diabetes 10.31 7.95 2.4 10.78 9.72 1.06 12.36 10.97 1.39

average 14.94 10.31 4.63 16.04 13.38 2.66 16.34 13.89 2.46

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1313

TABLE IV. THE CLASSIFICATION ERROR RATES OF THE INITIAL AND FINAL HYPOTHESES AND THE CORRESPONDING IMPROVEMENTS OF CEGSL,
TRI-TRAINING AND CO-TRAINING UNDER 40% UNLABEL RATE

BP
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 12.53 9.28 3.3 11.27 10.28 0.99 12.76 11.27 1.49
German 16.79 11.98 4.8 18.25 15.26 2.99 17.62 16.27 1.35
Breast-cancer 14.28 9.63 4.7 14.38 13.72 0.66 15.73 12.37 3.36
Diabetes 10.03 6.72 3.3 9.26 7.05 2.21 9.68 8.29 1.39
average 13.41 9.40 4.01 13.29 11.58 1.71 13.95 12.05 1.90
ID3
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 12.62 9.87 2.8 10.73 8.37 2.36 12.33 9.29 3.04
German 14.38 10.26 4.1 17.28 15.38 1.9 17.95 14.27 3.68
Breast-cancer 15.35 11.29 4.1 15.79 14.27 1.52 15.28 14.39 0.89
Diabetes 9.27 6.79 2.5 10.32 8.27 2.05 11.37 9.37 2
average 12.91 9.55 3.35 13.53 11.57 1.96 14.23 11.83 2.4

TABLE V. THE CLASSIFICATION ERROR RATES OF THE INITIAL AND FINAL HYPOTHESES AND THE CORRESPONDING IMPROVEMENTS OF CEGSL,
TRI-TRAINING AND CO-TRAINING UNDER 20% UNLABEL RATE

BP
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 12.39 10.27 2.1 13.76 12.53 1.23 11.92 10.22 1.7
German 13.05 9.38 3.7 14.32 12.09 2.23 14.38 12.75 1.63
Breast-cancer 11.76 10.34 1.4 12.68 11.06 1.62 12.25 10.29 1.96
Diabetes 9.26 7.28 2 9.59 9.07 0.52 10.39 8.17 2.22
average 11.62 9.32 2.30 12.59 11.19 1.40 12.24 10.36 1.88
ID3
Data set CEGSL Tri-training Co-training
initial final improv initial final improv initial final improv
Australian 10.75 7.39 3.4 12.97 11.29 1.68 11.25 10.27 0.98
German 13.27 10.28 3 13.28 12.05 1.23 12.33 11.79 0.54
Breast-cancer 11.82 9.25 2.6 12.77 10.75 2.02 13.59 11.52 2.07
Diabetes 7.25 6.27 1 8.95 8.03 0.92 9.68 7.89 1.79
average 10.77 8. 30 2.48 12 10.53 1.46 11.71 10.37 1.35

In Table II, under 80% unlabeled rate, the average In Table III, under 60% unlabeled rate, the average
error rate of corresponding improvements for CEGSL error rate of corresponding improvements for CEGSL
algorithm is 5.33% when BP neural network is used. It is algorithm is 4.32% when BP neural network is used. The
better than Tri-training (2.0%) and Co-training (3.45%). improvement of average error is higher than Tri-training
Similarly, when ID3 decision tree is used as classifier, (2.15%) and Co-training (2.13%).
CEGSL algorithm not only has higher error rate of In Table IV, under 40% unlabeled rate, the classifiers
corresponding improvement for German, Breast-cancer, get enough labeled data for learning. Both of classifiers
and Diabetes than Tri-training and Co-training, but also become more stronger. Therefore, the initial error rates
the final error rate (12.48%) is better than Tri-training for three algorithms are decreased. This causes the
(14.34%) and Co-training (13.62%). improvement of classification precision becomes smaller.
CEGSL, Tri-training and Co-training only get

2012 ACADEMY PUBLISHER


1314 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

improvement of 3.35%, 1.96%, 2.4% respectively. Under [8] S. Goldman, Y. Zhou. Enhancing supervised learning with
these circumstances, CEGSL still expresses better unlabeled example[C]. In: Proceedings of the 17th
performance. International Conference on Machine Learning (ICML00),
In Table V, under 20% unlabeled rate, the original San Francisco, CA, 2000, pp:327-334.
[9] X.J. Zhu. Z. Ghahramani. Semi-supervised learning using
labeled data can train a strong classifier and the Gaussian fields and Harmonic functions[C]. In:
performance of unlabeled data is decreased. The Proceedings of International Conference of Machine
improvements of CEGSL, Tri-training, and Co-training Learning(ICML03), Washington DC, 2003, pp:912-919.
only reach 2.48% 1.46% 1.35% respectively. The [10] Zhou Z H, Li M. Tri-training: Exploiting unlabeled
CEGSL shows greater achievement. example using three classifiers [J].IEEE Transactions on
Knowledge and Data Engineering 2005 17(11),
V. CONCLUSIONS pp:1529-1541.
[11] Tao Guo, Guiyang Li, Improved Tri-Training with
In this paper, the CEGSL algorithm is proposed. This Unlabeled example [C]. International Conference on
algorithm combines graph-based semi-supervised Nanotechnology and Computer Engineering (CNCE), 2011.
learning and collaboration-training algorithms. It makes [12] Yongzhao Zhan Yabi Cheng. Co-Training Semi-
use of structure information of sample data to calculate Supervised Active Learning Algorithm with Noise Filter[J].
the classification probability of unlabeled example Pattern Recognition and Artificial Intelligence,
explicitly. This algorithm is facilitated with good 2009,22(5):750-755
[13] Zhu X J. Semi-supervised learning with graphs [D].USA:
efficiency and generalization ability because it can
Carnegie Mellon University. 2006.1-89.
effectively select sample data to label and use multiple [14] Blum A, Chawla S. Learning from labeled and unlabeled
classifiers to help to perform the final hypothesis. example using graph mincuts [C]. in: Proceeding of 18th
Experiments on UCI datasets prove the efficiency of this International Conference on Machine Learning. 2001.
algorithm. CEGSL is worth studying in determination of [15] Pavan K, Rong J. SemiBoost: Boost for Semi-supervised
classification error rate in future work. Its applicability is Learning [J].IEEE Transactions on Pattern Analysis and
wide because it does not requires sufficient and redundant Machine Intelligence, 2009, 31(11):2000-2014.
views. Moreover, using statistical techniques to further [16] Zhou Z H, Wang Jue, Machine learning and application
identify and deal with noise data can be researched in the [M], Beijing, Tsinghua University Press, 2007, PP:259-275
future.

ACKNOWLEDGMENT
This work was sponsored by the Visual Computing and
Visual Reality Key Laboratory of Sichuan Province in China
Tao Guo received the M.S. degree from computer science
(No. PJ201102). and computer engineering at University of Arkansas of USA in
2001. She is an associate professor in the College of Computer
REFERENCES Science, Sichuan Normal University, China. Her current areas
of interest include data mining and bioinformatics.
[1] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, T. S. Email:tguo35@gmail.com
Huang. Semisupervised learning of classifiers: Theory,
algorithm, and their application to human-computer
interaction[J]. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2004, 26(12): 1553-1567. Guiyang Li received the Ph.D. degree from computer
[2] Z.-H. Zhou. Learning with unlabeled data and its science of Sichuan University of China in 2009. He is an
application to image retrieval. In Proceedings of the 9th associate professor in the College of Computer Science,
Pacific Rim International Conference on Artificial Sichuan Normal University, China. He is actively involved in
Intelligence, pages 510(2006). the development of network security. His current areas of
[3] Jimin Li, A Novel Semi-supervised SVM Based on Tri- interest include artificial immune computation, network
training for Intrusion Detection. JOURNAL OF security.Email:guiyang.li@gmail.com
COMPUTERS, VOL. 5, NO. 4, APRIL (2010).
[4] Kurt Driessens1, Peter Reutemann2, Using Weighted
Nearest Neighbor to Benefit from Unlabeled Data.
Encyclopedia of Machine Learning 2010: 857-862(2010).
[5] A. Blum, T. Mitchell. Combining labeled and unlabeled
example with co-training[C]. In: Proceedings of the 11th
Annual Conference on Computational Learning Theory
(COLT98), Wisconsin, MI, 1998, pp:92-100.
[6] Zhu X. Semi-supervised learning literature survey [R].
Technical Report 1530, Department of Computer Sciences,
University of Wisconsin at Madison, Madison, WI,
Jul.2008.
[7] D. Zhou, O. Bousquet, T. Lal etc. Learning with local and
global consistency [C]. Advances in Neural Information
Processing Systems(NIPS), Cambridge, MA:MIT Press,
2004, 16, pp:321- 328.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1315

Semantically Enhanced Uyghur Information


Retrieval Model
Bo Ma
Research Center for Multilingual Information Technology, Xinjiang Technical Institute of Physics and Chemistry,
Chinese Academy of Sciences, Urumuqi, China
Email: yanyushu@gmail.com

Yating Yang and Xi Zhou


Research Center for Multilingual Information Junlin Zhou
Technology, Xinjiang Technical Institute of Physics and Xinjiang Branch of Chinese Academy of Sciences,
Chemistry, Chinese Academy of Sciences, Urumuqi, Urumuqi, China
China Email: zhoujl@ ms.xjb.ac.cn
Email: { yangyt, zhouxi }@ms.xjb.ac.cn

AbstractTraditional Uyghur search engine lacks semantic technologies into traditional search engine. It combines
information, aiming to solve this problem, a semantically the concepts of ontology to process the annotation with
enhanced Uyghur information retrieval model was proposed matching of web resources and users queries to improve
based on the characteristics of Uyghur language. Firstly the search performance, and constructs the next
word stemming was carried out and web pages were
generation of search engine [1].
represented by the form of 3-triples to construct the Uyghur
knowledge base, then the matching between ontologies and Semantic search technologies can be divided into three
web pages was established by computing concept similarity categories, they are statistical based metrics, linguistic
and relation similarity. Semantic inverted index was built to based metrics and ontology based metrics. Latent
save the association between semantic entities and web Semantic Analysis (LSA) is a method of statistical based
pages, and user query analysis was implemented by metrics, which use algebraic methods to analyze the
expanding the queries and analyzing the relations between potential relations between a set of documents and terms
the queries, finally by combining the benefits of both [2]; linguistic based metrics refine the concepts in
keyword-based and semantic-based methods, ranking document sets by using thesaurus, one of the most used
algorithm was implemented. By comparing with the Google
methods is describing concepts by using WordNet [3];
search engine and the Lucene based method, the
experiments validate the effectiveness and the feasibility of ontology based metrics firstly construct high-quality
the model preliminarily. ontology and knowledge base, then use them to annotate
documents and execute the mapping between concepts
Index TermsUyghur, ontology, semantic search, semantic and user queries, for example, KIM, TAP, Hakia [4,5,6,7].
relation, information retrieval With the development of semantic web, a large number
of structured open metadata emerge, such as Freebase,
Linked Data, Apex and YAGO. Researchers began to use
I. INTRODUCTION these metadata to build semantic search framework, for
Along with the development of technologies and the example, PowerSet uses Freebase to annotate Wikipedia.
enrichment of the resources of Internet, WWW has To our knowledge, there are no semantic search
become a dynamic and huge information service network. prototypes for minority languages till now, in this paper,
Although traditional search engine is convenient, it has a semantically enhanced Uyghur information retrieval
the problems of low precision and recall, which is caused model is proposed, key technologies are presented and
by the lacking of semantic information of the keyword experiments are carried out to validate the effectiveness
matching technology and the misunderstanding of the of the model.
users intensions. In order to provide better service, the The remainder of this paper is organized as follows:
major search engines such as Google, Bing, etc. have Section 2 presents the characteristics of Uyghur and the
semantic search as a supplement to their search service. stemming algorithm for this language. In Section 3, the
In recent years, semantic search has been paid much semantic retrieval model is proposed and the key
attention, which introduces the semantic web technologies are discussed. Section 4 uses experiments to
validate the effectiveness of this model. Finally, we
conclude the paper in Section 5; possible extensions of
Manuscript received June 23, 2011; revised November 2, 2011; the proposed model are also mentioned in this section.
accepted December 31, 2011.
This work was supported in part by the Science and technology
projects of Xinjiang Uyghur Autonomous Region under Grant
201012112. Corresponding author: Ma Bo.

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1315-1320
1316 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

II. UYGHUR KNOWLEDGE REPRESENTATION We have collected more than 25,000 Uyghur stems,
which are capable for most of the Uyghur word
A. Uyghur stemming algorithm segmentation task, and vowel harmony is processed by
Uyghur is an official language of Xinjiang Uyghur rule based approaches, detailed information can be
Autonomous Region, which is spoken by the Uyghur referred in [1].
people. The Uyghur language belongs to the Uyghur B. Uyghur knowledge base construction
Turkic branch of the Turkic language family, which is
Ontology is a formal, explicit specification of a
controversially a branch of the Altaic language family. It
shared conceptualization [11]. It is a formal
has an alphabet of 32 letters and more than 120 forms of
representation of the knowledge by a set of concepts
characters. Uyghur is an agglutinative language, in which
within a domain and the relations between those
word is the smallest independent unit [8]. A word is
concepts. It is used to reason about the properties of that
composed of stem and suffix. For example, a user input a
domain, and may be used to describe the domain.
word: (development of China), the In recent years, many open ontologies emerge, we are
search engine should separate the word into stem and not aim to create more new ontologies, but to reuse the
suffix, such as , . existing ones. The crawled pages are processed into
Then it should return web pages include stem structured data, and stored as ontology resource; the
processing steps are as follows:
(China) and (development) and other relevant 1) Giving every page a unique URI (Uniform Resource
pages. Identifier). For example, a Uyghur page:
According to linguistic theory, morphemes are the http://uyghur.people.com.cn/155989/15153298.html,
smallest meaning-bearing units of language as well as the because the URL is unique, we use
smallest units of syntax [9]. MATHIAS CREUTZ and http://uyghur.people.com.cn/155989/15153298 as the
KRISTA LAGUS have proposed a model family called URI for this page;
Morfessor for the unsupervised induction of a simple 2) Define six properties for each page, they are label,
morphology from raw text data. tag, content, link (used for store the URL), pagelink (used
The model of language (M) consists of a morph for store the hyperlinks in the page), and relatedlink (used
vocabulary, and a grammar. That is to compute the for store the related link);
maximum a posteriori (MAP) estimate for the parameters 3) Store the URI and properties in N-TRIPLE format,
[10]: six N-TRIPLE files were built.
arg max P(M | corpus) = arg max P(corpus | M ) P(M ) (1) After format conversion, the mapping between
M
contents and ontology concepts were executed, we have
The MAP estimate consists of two parts: the
collected ontologies covering sports, finance,
probability of the model of language P(M) and the
entertainment, news, .etc form Swoogle and Google, the
maximum likelihood (ML) estimate of the corpus
structured data along with ontologies construct our
conditioned on the given model of language, written as
knowledge base.
P(corpus |M).
N
(2) C. Mapping between documents and ontologies
P(M ) = N [P( form( i )) P(usage( i ))]
i =1
Mapping between documents and ontologies is the
The probability of a morph is divided into two parts: selection of ontology concepts in essence, we
the form and the usage of the morph, the probability of implemented the mapping by computing the similarity
P(usage( i )) is equal to the prior probability of the between documents and concepts and similarity between
occurrence frequency and the length of i , P( form( i )) is concepts themselves:
computed as follows: Sim(W , O, r ) = 1 Simc (W , O ) + 2 Simr (W , O, r ) (5)
P(cij )
length( i )
P( form( i )) = j =1 (3) Where 1 + 2 = 1 , W means the word set of a document,
Where P (cij ) represents is the probability of the jth O means a specific ontology which matches the
document, r = {r , wi , w j }, wi , w j W , means the relation
letter in the ith morph in the lexicon.
between the words in a document, Simc (W , O ) represents
P(corpus | M ) = P( jk )
w n j
(4)
j =1 k =1 the similarity between W and O. Simr (W , O, r )
W is the number of words in the corpus, where each represents the similarity between concepts.
word can be represented by morph sequence, if a word Concept similarity Simc (W , O ) is defined as follows:
can be segmented into n j forms, P( jk ) means the
when a word wi matches the name or the content of
probability of the kth morph of the jth word.
We firstly use morphological segmentation method to rdfs:label property of concept c j , we define them as
extract morphemes of Uyghur, and then the morphemes exact match, when the stem of wi matches the name or
are compared with the stems and suffixes we have
collected, and the corrected stems are used for extraction the content of rdfs:label of concept c j , we define them
of Uyghur stemming, unknown morphemes are further as partial match. Simc (W , O ) is computed as follows:
processed by linguistic experts.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1317

i =m j =n
Simc (W , O ) = Simc (wi , c j )
(6)
i =1 j =1

Where m is number of words in the document, n is the


number of matched concepts in ontologies, Simc (wi , c j ) is
computed as follows:
1 if wi , c j completely match (7)
Simc (wi , c j ) = 0.5 if

wi , c j partially match
0 otherwise

For Simr (W , O, r ) , computing the similarity between
words can be transformed into the computing between the Figure 2. The Semantic enhanced retrieval model.
corresponding concepts [12]:
Simr (W , O, r ) = sim(ci , c j , r ) (8) user input and returned pages. The architecture is shown
ij e e h h as follows:
sim(ci , c j , r ) = e lmin h (9)
l e + e h A. Semantic indexing
Where l is the shortest path between ci and c j , h is Inverted index is the most used index structure in
min

the least common subsumer of ci and c j , ij is the number modern search engines. In this paper, we choose it as the
basic index structure, and the semantic indexing is
of path between ci and c j , l is the length of each path; established by combining keyword based index and
and is constant, which is used to control the impact semantic annotation, the steps are as follows:
1) Establish inverted index for web documents:
of l and h to the similarity. The normalizing result of the
establish mapping between the words of web documents
above formula is shown as follows:
and the concepts of ontologies, index the words along
Sim(W , O, r ) = Sim(W , O, r )
' (10
max{Sim(W , O, r )} with the corresponding concepts according to the
) discussion of 1.3;
Figure 1 shows the mapping of ontologies and 2) Establish inverted index for semantic entities in the
instances (to help understanding, we use English to show knowledge base: extract the textual representation from
the information in the graph, in the actual system, the the rdfs:label property of the entity, the textual
language is Uyghur), the top half of Figure 1 shows the representation are then searched in the document index,
internal structure of the ontology, the bottom half the retrieved documents are tagged as the potential
includes two instances of the ontology and shows the document set A;
mapping between the ontology and instances. 3) Extract the context of semantic entity: the
ontological relations are exploited to extract its semantic
context, the textual representation from the rdfs:label
property of its directly linked entities are extracted and
searched in the document index, the results are tagged as
the potential document set B;
4) We compute the intersection between set A and set
B, and the results are tagged as set C, which is the
document set corresponding to the semantic entity;
5) Weighting annotations: the weights of semantic
entities are computed as follows [13]:
S d + (1 ) C d (11)
Figure 2. Ontology-instance mapping graph. Where (0< <1), Sd means the weight of document
d of its own, and C d means the weight of d in its context.

III. SEMANTIC ENHANCED RETRIEVAL MODEL


Semantic enhanced retrieval model includes four
modules: resources collection, semantic annotation, query Selection of Selection of
the semantic context
analysis and results ranking. The resources collection a semantic entity
Semantic documents weight

module uses web crawler to download relevant entity


E1 D1,D2 0.5
Extract the content
WebPages; semantic annotation module annotates the Extract the content
of rdfs:lable of rdfs:lable
E2 D2,D3 0.2

crawled pages and establishes the semantic index; query


analysis module analyzes the semantic relevance by Search in the innverted Creation of
document index annotations
matching users query and ontology concepts; and the Search in the innverted
document index
results ranking module ranks the search results based on Document set B
the content matching and semantic relevance between
Documents to
Document set A A B
annotate

Figure 3. Semantic annotation based on context information.


2012 ACADEMY PUBLISHER
1318 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

The establishment of semantic index is shown in figure 3: keywords, but also considered the relations between them.
For example, when a user input the keyword Beijing,
B. Query analyzing
hotel, Tiananmen Square, the user may want to find
The task of query analysis is to establish the mapping the information about hotels around Tiananmen Square,
between query and ontology knowledge base. A given not the isolated information about the keywords. Because
word generally has synonyms, hypernyms, hyponyms and
the semantic annotation we built has already considered
so on, which we call lexical relation. Besides lexical
the relations between concepts, so the retrieved results
relation, there is also semantic relation between words
which describes the connections between concepts. In can be better organized by the ranking algorithm.
this paper, we present an approach which combines Because the knowledge base we built can not cover all
lexical relation and semantic relation to analyze users the concepts in the document set, we use TF/IDF to
query: evaluate the uncovered concepts as a complement.
1) Accept users input, extract the stems and save them In this paper, we propose a modulative method that
into a set N = {n1 , n2 ,...nk } , 1 i k , k is the number of ranks results based on how predictable a result might be
for users, which is a combination of semantic and
stems;
2) Analyze the lexical relation with synset, for each information-theoretic techniques.
stem ni ( 1 i k ), acquire the extension of it and save Firstly we calculate the relevance between keywords
and page content:
them as a set S i = {S i1 , S i 2 ,..., S im }, m = S i . Rij is the lexical
I (t ) = tf (t in d) idf (t ) boost(t. fieldin d) lengthNorm
(t. fieldin d) (13)
2

relevance between Sij and ni , 0 <= Rij <= 1 ; Where tf (t in d ) represents the frequency of term t in
3) Map the stems in S i to ontology library, save the document d, idf (t ) represents inverse document
corresponding ontology in set Ti = {Ti1 , Ti 2 ,..., Tif }, and then we frequency, which is a measure of the general importance
will get k sets. R ij' is the lexical relevance between Tij of the term. boost(t. field in d ) represents the stimulation
factor for each field when the index is established.
and ni , assuming Tij is the extension of Sik ( 1 k m ),
lengthNorm(t. field in d ) represents the length factor for
then R ij' is equal to Rik ; each field.
4) Given a set Then the similarity between users query and the
W = {(1,2 ,,k ) | (i Ti ,Ti ,1 i k ) (i = , Ti = ,1 i k )} documents is calculated, we define
, compute the value of SR n , which is then used for K = {k1 , k 2 ,..., k m }, k i , k j K is the extension of the
ranking the wn ( wn W ): query, C = {c1 , c2 , cn }, ci , c j C is the corresponding
i , j =k i =k
(12) concepts of K, ij represents the number of the relations
SRn = srij + ri
i , j =1 i =1
between ci and c j in the knowledge base, ij represents
Where srij is the semantic relevance between i and
the number of relations between ci and c j in the
j , ri is the lexical relevance of i . and are the
document context. According to the probability theory,
weights of lexical relevance and semantic relevance the probability of a particular document which is
( 0 <= , <= 1 , + = 1 ). The computation is as following. interested by the user can be calculated by the
Given two instances i , j wn , i , j , if they can formula , l represents the
P(q, d ) = ij

(1 i, j n, i j )
reach each other within a limited number of steps by ij
using breadth-first search, then we consider them relevant, length of a path between concepts, the similarity between
their semantic relevance is related to the length of query and concepts are as follows:
SemMatch(q, d ) = P(q, d ) (2 l )
1
minimum sequence, the shorter, higher of the relevance, (14)
and vice versa. If i or j is equal to , or they are not Now we add a search mode which ranging from 0
relevant, then srij = srmin . to 1, with 0 indicating purely conventional mode and 1
indicating purely semantic mode, respectively. Based on
5) Sorting the elements in W according to the value this, we build a modulative ranking model shown below:
of SR , the smaller the value is, the more possibility that it SemRank = (1 )I (t ) (1 + SemMatch(q, d )) (15)
meets the users query.
6) Searching the elements in W in semantic index,
IV. EXPERIMTNTAL RESULTS
retrieved results are then ranked by the result ranking
module. 10G Uyghur WebPages were crawled from Internet,
and then preprocessed into N-triples; two indices were
C. Ranking algorithm
built, they were traditional index and semantic index. 10
Page ranking is one of the key technologies of search commonly used user queries were chosen to construct the
engine, because the quality of the returned results will input set, and the average length of them were 4 words.
directly influence users experience. When the keywords Google Uyghur version and Lucene were chosen as the
are inputted, the best results are not only contains the benchmark, and precision, recall, Mean Average

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1319

Precision (MAP) and Precision at 10 (P@10) were TABLE I.


chosen as the evaluation metrics. COMPARISON OF DIFFERENT SYSTEMS P@10
TP (16)
precision = Semantic
TP + FP Query
Search
Lucene Google
TP (17)
recall = 1 0.7 0.5 0.6
TP + FN 2 0.5 0.3 0.5

(P(r ) rel (r ))
N 3 0.3 0.3 0.4
MAP = r =1 (18) 4 0.6 0.5 0.5
N 5 0.6 0.4 0.7
Where N represents the number of retrieved 6 0.4 0.2 0.2
7 0.8 0.5 0.6
documents, TP is the number of returned relevant pages, 8 0.1 0.2 0.1
FN is the number of returned irrelevant pages, and FP is 9 0.2 0.2 0.3
the number of the relevant pages that did not returned; r 10 0.35 0.3 0.4
Mean 0.47 0.34 0.43
represents the position of a particular page in the
retrieved results, and P(r ) represents the precision of the
topped r retrieved results, rel (r ) is a binary function, documents is relevant. The experimental results are
shown as follows:
which determines whether the rth page in the retrieved From the experimental results, we can see that there
are not big difference on precision and MAP for the three
systems, but the result of semantic search on P@10
TABLE II. outperforms the other two systems, which means the
COMPARISON OF DIFFERENT SYSTEMS PRECISION
pages retrieved by our proposed model can better meet
Semantic users needs.
Query Lucene Google
Search
1 0.74 0.58 0.66 ACKNOWLEDGMENT
2 0.58 0.61 0.68
3 0.42 0.46 0.51 Our thanks to Xi Zhou, Lei Wang, Turghun Ousiman,
4 0.67 0.67 0.54
5 0.56 0.61 0.47
and all the members of the research center of multilingual
6 0.36 0.29 0.44 information technology. This work is founded by Science
7 0.67 0.52 0.56 and Technology project of Xinjiang Uyghur Autonomous
8 0.29 0.32 0.33 Region titled Uyghur, Kazak Search Engine Retrieval
9 0.34 0.38 0.41
10 0.48 0.36 0.37
Server.
Mean 0.51 0.48 0.50
REFERENCES
TABLE III. [1] Bo Ma, Yating Yang, Xi Zhou, Junlin Zhou. An Ontology-
COMPARISON OF DIFFERENT SYSTEMS RECALL
based Semantic Retrieval Model for Uyghur Search
Semantic Engine[C] // IEEE 2nd Symposium on Web
Query Lucene Google Society(SWS2010), Beijing: 2010: 191-195.
Search
1 0.66 0.61 0.70 [2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer,
2 0.48 0.63 0.67 R. Harshman. Indexing by latent semantic analysis[J].
3 0.45 0.48 0.54 Journal of the Society for Information Science, 1990, 41
4 0.59 0.66 0.59 (6): 391407.
5 0.52 0.59 0.45 [3] E. Vorhees. Query expansion using lexical semantic
6 0.35 0.38 0.48
7 0.61 0.54 0.61
relations[C] // 17th Annual International ACMSIGIR
8 0.22 0.33 0.32 Conference on Research and Devel- opment in Information
9 0.35 0.34 0.43 Retrieval (SIGIR 1994), Dublin, Ireland: 1994: 6167.
10 0.51 0.43 0.39 [4] A. Kiryakov, B. Popov, I. Terziev, D. Manov, D.
Mean 0.47 0.50 0.52 Ognyanoff, Semantic annotation, indexing, and
retrieval[J]. Journal of Web Semantics, 2004, 2 (1): 4979.
TABLE IV. [5] B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, A.
COMPARISON OF DIFFERENT SYSTEMS MAP Kirilov, KIMa semantic platform for information
extraction and retrieval[J]. Journal of Natural Language
Semantic Engineering, 2004, 10 (34): 375392.
Query Lucene Google
Search
[6] R.V. Guha, R. McCool, E. Miller, Semantic search[C]// the
1 0.46 0.41 0.44
2 0.28 0.29 0.31
12th International World Wide Web Conference
3 0.23 0.24 0.25 (WWW2003), Budapest, Hungary: 2003: 700709.
4 0.37 0.34 0.33 [7] http://www.hakia.com.
5 0.24 0.27 0.22 [8] TURDI Tohti, WINIRA Musajan, ASKAR Hamdulla.Key
6 0.22 0.18 0.23 Technoques of Uyghur, Kazak, Kyrgyz Full-text Search
7 0.33 0.27 0.28 Engine Retrieval Server[J]. Computer Engineering2008
8 0.18 0.19 0.19
34(21)44-46.
9 0.22 0.23 0.26
10 0.33 0.24 0.30 [9] MATTHEWS, P. H. 1991. Morphology 2nd Ed.
Mean 0.29 0.27 0.28 Cambridge Textbooks in Linguistics.

2012 ACADEMY PUBLISHER


1320 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[10] Creutz, Mathias. Unsupervised Models for Morpheme Yating Yang was born in Xinjiang Province, China in 1985.
Segmentation and Morphology Learning. ACM She received the B.S. degree in computer science from
Transactions on Speech and Language Processing 2007; Changan University in 2007. Currently she is a student and she
4(1). will receive the Ph.D degree in computer science from Graduate
[11] Gruber T R. Toward principles for the design of ontologies University Chinese Academy of Sciences in 2012. She is a
used for knowledge sharing. International Journal Human student member of China Computer Federation. Her research
Computer Studies, 1995, 43( 5-6) : 907-928 interest is multilingual information processing.
[12] WANG ZhiXiao, ZHANG DaLu.Optimization Algorithm
for Edge-Based Semantic Similarity Calculation[J].PR &
AI201023(2)273-277.
[13] M. Fernndez, D. Vallet, P. Castells, Probabilistic score Xi Zhou was born in Hunan Province, China in 1978. He
normalization for rank aggregation[C]// 28th European received his M.S. degree in computer science from Graduate
Conference on Information Retrieval (ECIR 2006), University Chinese Academy of Sciences in 2003. He is the
London, UK: 2006: 553556. leader of Research Center for Multilingual Information
Technology, Xinjiang Technical Institute of Physics and
Chemistry, Chinese Academy of Sciences. And his research
interest is multilingual information processing.

Bo Ma was born in Liaoning Province, China in 1984. He


received the B.S. degree from Huazhong University of Science
and Technology in 2007. Currently he is a student and he will Zhou Junlin was born in Shanxi Province, China in 1945. He
receive the Ph.D degree in computer science from Graduate received his B.S. degree from Xian Jiaotong University. He is
University Chinese Academy of Sciences in 2012. He is a the vice president of Xinjiang Branch of Chinese Academy of
student member of China Computer Federation. His research Sciences and a supervisor of postgraduate. He is a member of
interests are data mining and semantic search. China Computer Federation. And his research interest is
multilingual information processing.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1321

Formalizing Domain-Specific Metamodeling


Language XMML Based on First-order Logic
Tao Jiang
School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming, P.R.China
Email: jtzwy123@gmail.com

Xin Wang
School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming, P.R.China
Email: wxkmyn@yahoo.com.cn

AbstractDomain-Specific Modeling has been widely and structural properties; the latter concerns execution
successfully used in software system modeling of specific semantics of domain metamodels, focusing on the
domains. In spite of its general important, due to its dynamic behavior of the metamodels. Although structural
informal definition, Domain-Specific Metamodeling semantics is very important, research in structural
Language (DSMML) cannot strictly represent its structural
semantics is not as extensive and deep as behavioral
semantics, so its properties such as consistency cannot be
systematically verified. In response, the paper proposes a semantics, so this paper only studies structural semantics
formal representation of the structural semantics of of DSMML.
DSMML named XMML based on first-order logic. Firstly, There are several problems that have not been solved
XMML is introduced, secondly, we illustrate our approach well for DSMML, which include precise formal
by formalization of attachment relationship and refinement description of its semantics, method of verification of
relationship and typed constraints of XMML based on first- properties of domain metamodels based on formalization
order logic, based on this, the approach of consistency and automatic translation from metamodels to
verification of XMML itself and metamodels built based on corresponding formal semantic domain.
XMML is presented, finally, the formalization automatic
The paper proposes a formal representation of the
mapping engine for metamodels is introduced to show the
application of formalization of XMML. structural semantics of DSMML named XMML designed
by us based on first-order logic, based on this, the
Index TermsDomain-Specific Metamodeling Language, approach of consistency verification of XMML and
structural semantics, attachment, refinement, consistency metamodels is presented, and then design and
verification implementation of corresponding formalization automatic
mapping engine for metamodels is introduced to show the
application of formalization of XMML.
I. INTRODUCTION
Compared with the uniformity and standardization of II. RELATED WORKS
MDA [1], DSM [2] focuses on simplicity, practicability Within the domain-specific language community,
and flexibility. As a metamodeling language for DSM, graph-theoretic formalisms have received the most
DSMML plays an important role in system modeling of research attention [4]. The majority of work focuses on
specific areas. model transformations based on graph, but analysis and
DSMML is a metalanguage used to build Domain- validation of properties of models has not received the
Specific Modeling Languages (DSMLs); this process that same attention. For example, the model transformation
we use DSMML to build domain metamodels indicating tool VIATRA [5] supports executable Horn logic to
the structural semantics of DSMLs is called specify transformations, but does not focus on restricting
metamodeling. Correspondingly, DSML is modeling expressiveness for the purpose of analysis.
language used to build domain application models; this Because UML includes many diagrams including
process that we use DSML to build domain application metamodeling, state machines, activities, sequence charts
models is called application modeling. and so on, approaches for formalizing UML must tackle
Semantics of DSMML can be grouped into structural the temporal nature of its various behavioral semantics,
semantics [3] and behavioral semantics. The former necessitating more expressive formal methods. All these
concerns static semantic constraints of relationship approaches must make trade-offs between expressiveness
between modeling elements, focusing on the static and the degree of automated analysis. For example, Z [6]
or B [7] formalizations of UML could be a vehicle for
Supported by Yunnan Provincial Department of Education Research studying rich syntax, but automated analysis and
Fund Key Project (No. 2011z025) and General Project (No. 2011y214) verification is less likely to be found.
Corresponding author: E-mail addresses: jtzwy123@gmail.com

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1321-1328
1322 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

There are much typical work on formalization of Metamodeling element of entity type contains four
modeling language, such as Andres formalization and types such as model type, entity type, reference entity
verification of UML class diagram based on ADT [8], type and relationship type; metamodeling element of
Kaneiwas formalization of UML class diagram based on association type includes the following six types: role
first-order logic [9], Paiges formalization of BON based assignment association used to establish the connection
on PVS [10] and Jackson.E.Ks formalization of DSML between the entity type and relationship type, model
based on Horn logic [11] and so on. Without considering containment relationship used to build the relationship
formalization of metamodeling language and automatic that model contain all entity type modeling elements,
translation from metamodels to the corresponding formal attachment relationship used to describe close
semantic domain, these approaches have lower level of containment relationship between entity type modeling
automated analysis and verification. elements, entity containment relationship used to
describe loose containment relationship between entity
III. AN INTRODUCTION TO XMML type modeling elements, reference relationship used to
build reference between reference entity and referenced
We begin by introducing layered architecture of
entity, and refinement relationship used to establish
XMML and then an overview of abstract syntax of
XMML is showed. correspondence between the entity and its refined model
for multi-layer modeling and model refinement. The
A. Layered Architecture of XMML structural semantics of XMML will be formalized based
Similar to structure of UML, XMML is divided into on the above ten types of metamodeling elements.
the following four layers: metamodeling language layer
used to define different DSMLs where XMML is located, IV. FORMALIZATION OF XMML BASED ON FIRST-ORDER
DSML layer used to build concrete domain application LOGIC
models, domain application model layer used to make We give a formal definition of XMML, based on this,
corresponding source codes of target system by code attachment relationship, refinement relationship and
generator, and target application system layer [12]. typed constraints of XMML is formalized based on first-
Layered Architecture of XMML is shown in Figure 1. order logic to show our approach for formalizing the
structural semantics of XMML.
A. A Formal Definition of XMML
XMML can be regarded as composition of the
following five parts: a set of predicate symbols SXMML
denoting corresponding metamodeling elements, an
extended set of predicate symbols S XCM M L used to derive
properties, a set of closed first-order logic formulas
FXMML denoting constraints over all metamodels built
based on XMML, a set of constants OXMML denoting
public properties, a set of terms symbols XMML denoting
Figure 1. Layered architecture of XMML modeling elements constituting metamodel. Among them,
S XCM M L and OXMML may be empty, FXMML is defined using
In order to distinguish between model elements of
different levels of abstraction, we require that element of first-order logic implication formulas based on SXMML,
XMML is called metamodeling element and element of S XCM M L and OXMML. The definition concerns formal
DSML built based on metamodeling is called domain characterization of structural properties of XMML,
modeling element and domain object built based on focusing on description of constraint relationship between
domain application modeling is called domain model modeling elements. So XMML is defined as following.
element. Among them, metamodeling element is also Definition 1 (XMML). DSMML named XMML LXMML
called meta-type and type of model element is name of is a 5-tuple of the
modeling element and type of modeling element is name form S X M M L , S XCM M L , X M M L , O X M M L , F X M M L , consisting of
of meta-type. SXMML, SXMML
C
,OXMML, FXMML and XMML .
B. Abstract Syntax of XMML SXMML and S XCM M L as a group of predicate symbols,
We extend and refine abstract syntax of XMML to OXMML as a group of constant symbols, and FXMML as a
meet the needs of formalization and consistency
verification. Metamodeling element of improved XMML group of constraint axioms are all added to first-order
is divided into two types: entity type and association type, logic formalized system called predicate calculus Q
the former is used to describe modeling entities in domain [13][14] to form formalized system of XMML called
metamodel and the latter concerns relationships between TXMML based on predicate calculus Q. The powerset of the
modeling entities. term algebra M X M M L = P ( T S ( X M M L )) over SXMML
XM M L

generated by XMML= X M M L OXMML is considered as a

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1323

group of interpretations of TXMML to determine whether Interface and Component in Figure 5 is illegal. This can
any metamodel m XMML M XMML is well-formed for be expressed as an implication formula named Attach3 in
XMML. Once SXMML, S XCM M L , OXMML and FXMML are the form of x, y. Attachment ( x, y ) Attachment ( y, x) .
4) Attachment Path: To maintain well-formedness and
derived, we finish formalization of LXMML based on first-
reduce the complexity, we require that only one layer of
order logic.
attachment path between two entities is legal and two or
B. Formalization of Meta-type of Entity Type more layers of attachment path formed between entities
For each Model, a unary predicate Model ( x ) is defined are prohibited. For example, two layers of attachment
path formed by Interface attached to Component and
to denote meta-type of modeling element x is Model,
Component attached to Subsystem in Figure 6 is not
i.e. Model ( x ) SXMML . Model can contain other two
allowed. Assume that two layers of attachment path
modeling elements of entity type. For each Entity, a unary formed by x attached to y and y attached to z is denoted as
predicate Entity ( x ) is defined to denote meta-type of AttaPath(x,y,z), i.e. AttaPath( x, y, z) SXMML
C
,
modeling element x is Entity, i.e. Entity ( x ) SXMML . AttaPath(x,y,z) can be defined by Attachment as an
Entity can be contained in model by model containment implication formula in the form of
relationship or point to refined model by refinement x , y , z . Attachm ent ( x , y ) Attachm ent ( y , z )
,
relationship or establish association with other entity by ( x y ) ( y z ) ( x z ) AttaPath ( x , y , z )
role assignment association or form containment with
so we can express this constraint as a predicate formula
other entity by attachment relationship or entity
containment relationship. For each Reference Entity, a named Attach4 in the form of x, y, z.AttaPath( x, y, z) .
unary predicate RefEntity ( x ) is defined to denote meta-
type of modeling element x is Reference Entity, i.e.
RefEntity(x) SXMML . Reference Entity can point to
referenced entity by reference relationship. Similarly, for
each Relationship, a unary predicate R e lationship( x) is
defined to denote meta-type of modeling element x is
Relationship, i.e. Relationship(x) SXMML . Relationship can
be used to establish explicit association between Figure 2. An example of metamodel Figure 3. Attachment
modeling element of entity type combined with role
assignment association.
C. Formalization of Attachment Relationship
For each attachment relationship (denoted Attachment)
from modeling element of entity type x to y, a binary
predicate Attachment(x,y) is defined to represent that
element x is attached to element y, i.e.
Attachment ( x, y) SXMML . In the metamodel shown in Figure 4. Self-attached Figure 5. Attachment loop
Figure 2, modeling element of entity type Interface is
attached to Component, so
Attachment(Interface,Component) is a legal binary
predicate symbol of attachment meta-type. As can be
seen from Figure 3, there exist the following several
constraint relationships.
1) Type Constraint: Attachment edge must start from
and also end with modeling element of entity type. This
can be expressed as an implication formula named
Attach1 in the form
of x, y. Attachment ( x, y ) Entity ( x) Entity ( y ) .
2) Self-attached Constraint: Due to its close containment,
the same modeling element of entity type cannot be Figure 6. Attachment path Figure 7. An example of attachment
attached to itself. For example, self-attached of Interface
in Figure 4 is not allowed. We can express this as a
predicate formula named Attach2 in the form According to the Attach1, both ends connected by
of x. Attachment ( x , x ) . attachment edge are all modeling elements of entity type,
3) Attachment Loop: Attachment loop formed between thus, from the perspective the semantics of first-order
two modeling elements of entity type is not allowed logic, we can prove the semantic non-implication from
because it expresses a contradictory and meaningless Attach1 to Attach2, Attach3 and Attach4 by finding a
modeling intent. For example, attachment loop between

2012 ACADEMY PUBLISHER


1324 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

counter-example interpretation that makes Attach1 true semantic interpretation that satisfies the formula set, or
and makes Attach2, Attach3 and Attach4 false. the formula set is satisfiable. By related definitions of
Theorem 1 (Semantic non-implication of attachment first-order logic, theorem is proved.
constraint). Formula Attach1 cannot semantically entail According to related theorems of first-order logic [14],
formula Attach2, Attach3 and Attach4, i.e. Attach1| the formula set is grammatical consistent, thus, it is
Attach2, Attach1| Attach3 and Attach1| Attach4. consistent. So the formula subset of attachment
Proof. As a semantic interpretation of formula set constraints named AttachmentSet is comprised of Attach1,
composed of Attach1Attach4, the metamodel shown in Attach3 and Attach4, i.e. AttachmentSet= {Attach1,
Figure 4 can be expressed as a set of predicate statements Attach3, Attach4}.
composed of Attachment(Interface,Interface) and D. Formalization of Refinement Relationship
Entity(Interface) that makes Attach1 true and makes
For each refinement relationship (denoted Refinement)
Attach2 false due to self-attached of Interface, so we can
from modeling element of entity type x to model type y, a
derive Attach1| Attach2. Similarly, the metamodel
binary predicate Refinement(x,y) is defined to represent
shown in Figure 5 can be expressed as a set of predicate
that element x points to element y by refinement edge, i.e.
statements composed of Attachment(Interface,Component)
and Attachment(Component,Interface) that makes Refinement ( x, y) SXMML . In the metamodel shown in
Attach1 true and makes Attach3 false due to attachment Figure 8, the edge Refinement(Component,
loop formed between Interface and Componennt, thus, SoftwareArchitecture) built by modeling element of
Attach1| Attach3 can be derived. In addition, entity type Component pointing to its refined model
Attachment(Interface,Component) and SoftwareArchitecture is a legal binary predicate symbol
Attachment(Component,Subsystem) corresponding to the of refinement meta-type. As can be seen from Figure 9,
metamodel in Figure 6 all satisfy Attach1 but make there exist the following several constraint rules.
Attach4 false due to two layers of attachment path formed 1) Type Constraint: refinement edge must start from
among Interface, Componennt and Subsystem, therefore, modeling element of entity type and end with modeling
we can derive Attach1| Attach4. element of model type. This can be expressed as an
Are there grammatical inference relationships among implication formula named Refine1 in the form
Attach2, Attach3 and Attach4? We find that Attach2 can of x, y.Refinement ( x , y ) Entity ( x ) Model ( y ) .
be derived from Attach3 based on natural deduction rules 2) Uniqueness Constraint: the same modeling element
for quantifiers (NDRQ) which include premise of entity type cannot point to two or more refined models,
introduction rule (denoted P), separation rule (denoted S) , otherwise ambiguity will be produced. For example, the
return false rule (denoted N) and quantifier rule (denoted metamodel in Figure 13 is illegal because the modeling
Q) and so on [14]. Therefore, we can derive the following element Component points to two different refined
theorem. models SoftwareArchitectureA and
Theorem 2 (Grammatical inference relationship of SoftwareArchitectureB. We can express this as an
attachment constraint). Formula Attach2 can be derived implication formula named Refine2 in the form of
from Formula Attach3, i.e. x, y, z.Refinement ( x, y ) Refinement ( x, z ) ( y = z ) .

x, y.Attachment(x, y) Attachment ( y, x) x.Attachment(x, x) 3) Identity Constraint: the refined model that the
Proof. (Derivation is omitted). modeling element of entity type points to and the model
Because of Attach3 Attach2, after Attach2 is in which it is contained are identical to build multi-layer
removed, there are only Attach1, Attach3 and Attach4 model structure using recursive relationship. For example,
among which there are 6 pairs of semantic non- in Figure 14, the refined model SoftwareArchitectureB of
implication relations. Similar to theorem 1, we can also Component and the model SoftwareArchitectureA
derive Attach3| Attach4, Attach3| Attach1, Attach4| containing it are different, so multi-layer model structure
Attach1 and Attach4| Attach3, so it is obvious that cannot be built based on it. This can be expressed as an
Attach1, Attach3 and Attach4 are independent on implication formula named Refine3 in the form of
semantics. Therefore, the formula set of attachment x, y , z.Refinement ( x , y ) Containment ( x , z ) ( y = z )
constraints only contains Attach1, Attach3 and Attach4. . In formula Refine3, Containment( x, y) is a binary
Theorem 3 (Semantic consistency of formula set). The predicate denoting model containment relationship in
formula set comprised of Attach1, Attach3 and Attach4 is which modeling element of entity type x is contained in
semantic consistent. model type y.
Proof. As a semantic interpretation of formula set 4) Self-refinement Constraint: the same modeling
composed of Attach1, Attach3 and Attach4, the element of entity type cannot point to itself by refinement
metamodel shown in Figure 7 can be expressed as a set of edge. For example, self-refinement of Component in
predicate statements composed of Figure 10 is not allowed. We can express this as a
Attachment(Interface,Component) and predicate formula named Refine4 in the form
Attachment(Interface,Connection). Because there do not
of x . Refinem ent ( x , x ) .
exist attachment loops and two or more layers of
attachment paths in the metamodel, both of them all 5) Refinement Loop Constraint: the refinement loop
satisfy Attach1, Attach3 and Attach4. Therefore, the formed between two modeling elements is not allowed
metamodel shown in Figure 7 can be considered as a because it expresses a contradictory and meaningless

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1325

modeling intent. For example, refinement loop formed by 6) Refinement Path Constraint: To maintain well-
Component and SoftwareArchitecture pointing to each formedness and reduce the complexity, we require that
other in Figure 11 is illegal. This can be expressed as an only one layer of refinement path between two entities is
implication formula named Refine5 in the form legal and two or more layers of refinement path are
of x , y.Refinement ( x , y ) ( x y ) Refinement ( y , x ) . prohibited. For example, two layers of refinement path
formed by ComponentA pointing to SoftwareArchitecture
and SoftwareArchitecture pointing to ComponentB in
Figure 12 is not allowed. Assume that two layers of
refinement path formed by x pointing to y and y pointing
to z is denoted as RefinePath( x, y, z) , i.e.
Figure 8. An example of metamodel
RefinePath( x, y, z ) SXMML
C , RefinePath(x, y, z) can be defined
by refinement as an implication formula in the form of
x , y , z.Refinement ( x , y ) Refinement ( y , z )
, so
( x y ) ( y z ) ( x z ) RefinePath ( x , y , z )
Figure 9. Refinement Figure 10. Self- refinement we can express this constraint as a predicate formula
named Refine6 in the form of
x, y , z. RefinePath( x, y , z ) .
Any modeling element belongs to one and only one
meta-type, on the other hand, according to the Refine1,
both ends connected by refinement edge belong to
different meta-type, therefore from the perspective the
semantics of first-order logic, we can prove the semantic
implication from Refine1 to Refine4, Refine5 and Refine6.
Theorem 4 (Semantic implication of refinement
constraint). Formula Refine1 can semantically entail
formula Refine4, Refine5 and Refine6, i.e.
Figure 11. Refinement loop Figure 12. Refinement path of two layers Refine1Refine4, Refine1Refine5 and Refine1Refine6.
Proof. Any semantic interpretation that makes Refine1
true prompts refinement to satisfy the relationship that
both ends of refinement belong to different meta-type in
which one end is modeling element of entity type and the
other end is modeling element of model type; obviously,
the relationship excludes the possibility of self-
refinement of the same modeling element and also makes
it impossible to form refinement loop and two or more
layers of refinement path, thus, this interpretation
certainly makes Refine4, Refine5 and Refine6 true,
according to related definition of semantic implication of
Figure 13. Refinement ambiguity Figure 14.Refined and containing model first-order logic, theorem is proved.
Now formula set of refinement constraints contains
only Refine1, Refine2 and Refine3, are there semantic
implication relationships among them? We find that
Refine2 can be derived by identity of refinement named
Refine3 and uniqueness of the models in which the same
Figure 15. An example of violating Refine1
modeling element of entity type is contained named cont5.
In Figure 16, the modeling element x points to two
different refined models R1 and R2 by two different
refinement edges and the model M that can contain x is
unique by cont5, thus by Refine3 R1 and M are the same
modeling element of model type, i.e. R1=M, similarly, R2
and M are the same modeling element of model type, i.e.
R2=M, so R1= R2.
Only Refine1 and Refine3 left in the set and among
them there are two kinds of semantic implication
relationship. Although syntax derivation between them
cannot be directly proved, we can explain the semantic
non-implication from Refine1 to Refine3 by finding a
Figure 16. Refine3 and Con5 deriving Refine2
counter-example interpretation that makes Refine1 true
and makes Refine3 false from the perspective the

2012 ACADEMY PUBLISHER


1326 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

semantics of first-order logic. Similarly, the semantic because it contains elements not belonging to any meta-
non-implication from Refine3 to Refine1 can also be type of entity type. We can express this as a predicate
explained. formula named Type1 in the form
Theorem 5 (Semantic non-implication of refinement of x.Model ( x) Entity( x) R e lationship( x) RefEntity( x) .
constraint). Formula Refine1 cannot semantically entail Type1 denotes that any modeling element of entity type
formula Refine3, otherwise the same, i.e. must belong to one of the above four meta-types.
Refine1|Refine3 and Refine3|Refine1. 2) Uniqueness of Classification
Proof. As a semantic interpretation of formula set The classification of modeling elements of entity type
composed of Refine1 and Refine3, the metamodel shown must be unique because allowing a modeling element to
in Figure 14 can be expressed as a set of predicate belong to more than one meta-type leads to ambiguous
statements composed of interpretation of the element, so we require that every
Refinement(Component,SoftwareArchitectureB) and modeling element belongs to one and only one meta-type.
Containment(Component,SoftwareArchitectureA) that This can be expressed as a group of implication formulas
makes Refine1 true and makes Refine3 false due to named Type2 in the form of
violation of identity constraint, so we can derive x.Model ( x ) Entity ( x )
Refine1|Refine3. Similarly, the metamodel shown in x.Model ( x ) Relationship ( x )
Figure 15 can be expressed as a set of predicate .
x.Model ( x ) RefEntity ( x )
statements composed of
x.Entity ( x ) Relationship ( x )
Refinement(ComponentA,ComponentB) and
Containment(ComponentA,ComponentB) that makes x.Entity ( x ) RefEntity ( x )
Refine3 true and makes Refine1 false due to violation of x.Relationship ( x ) RefEntity ( x )
type constraint, so Refine3|Refine1 can be derived. Number of formulas numtype2 in Type2 is combination
Theorem 6 (Semantic consistency of formula set). The number produced by taking any two elements from six
formula set comprised of Refine1 and Refine3 is semantic elements, i.e. numtype2=4(4-1)0.5=6. So the formula
consistent. subset of typed constraints named TypedSet is comprised
Proof. As a semantic interpretation of formula set of Type1 and Type2, i.e. TypedSet = {Type1, Type2}.
composed of Refine1 and Refine3, the metamodel shown TypedSet makes it explicit that a metamodel as an
in Figure 8 can be expressed as a set of predicate instance of XMML must have its modeling elements of
statements composed of entity type completely and uniquely classified by four
Refinement(Component,SoftwareArchitecture) and types of meta-types. This reflects strict meta-modeling
Containment(Component,SoftwareArchitecture). Because principle proposed in the literature [15].
Compoment belongs to entity type and
F. Formalization of Other Meta-type of Association Type
SoftwareArchitecture is an element of model type and the
refined model that Compoment points to and the model in By formalizing other meta-types of association type in
which it is contained are same SoftwareArchitecture, both the same way, we can establish formula subset of role
of them all satisfy Refine1 and Refine3. Therefore, the assignment association constraints named
metamodel shown in Figure 8 can be considered as a RoleAssginRelaSet, formula subset of model containment
semantic interpretation that satisfies the formula set, or constraints named ContainmentSet, formula subset of
the formula set is satisfiable. By related definitions of entity containment constraints named EntiContSet and
first-order logic, theorem is proved. formula subset of reference constraints named
According to related theorems of first-order logic [12], ReferenceSet one by one. Based on this, formula subset of
the formula set is grammatical consistent, thus, it is exclusion constraints named ExclusionSet is created to
consistent. So the formula subset of refinement represent exclusive constraints among all meta-types.
constraints named RefinementSet is comprised of Refine1 Therefore, set of constraint axioms of TXMML named
and Refine3, i.e. RefinementSet = {Refine1, Refine3}. FXMML can be considered as union of all of the above
E. Formalization of Typed Constraints subsets, i.e.
FXMML = ContainmentSetAttachmentSetEntiContSet
XMML is one of typed metamodeling languages, so
the metamodels built based XMML are well-typed. On RoleAssginRelaSetRefinementSet
the basis of the relevant literatures [15], we characterize ReferenceSetExclusionSetTypedSet.
typed constraints of XMML from completeness and
uniqueness of classification of modeling elements of V. CONSISTENCY AND VERIFICATION OF XMML AND ITS
entity type. METAMODELS
1) Completeness of Classification
Formalized system of XMML called TXMML based on
Four types of meta-types of entity type are defined in
predicate calculus Q is established by formalization of all
XMML to build metamodels, so their classification is
meta-types of XMML. The semantic interpretation of
also established. Such a classification must be complete
TXMML is a metamodel built based on XMML, universe of
in the sense that every modeling element of entity type
discourse of interpretation is the set of all entity modeling
must be an instance of one meta-type of entity type;
elements and constants contained in the metamodel.
otherwise the metamodel will become meaningless
Similarly, metamodel built based on XMML can be

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1327

formalized via metamodel mapping from metamodel to a Based on .net 2.0 platform, by using C#.net as
set of predicate statements. development language, we implement the corresponding
Once XMML and metamodel are formalized based on prototype system for MapM and integrate them in the
first-order logic, we can implement logical consistency modeling environment named Archware [12] of XMML,
verification of XMML and its metamodel based on first- thus it becomes possible for Archware to verify
order logical inference. metamodels built based on XMML. Running interface of
MapM is shown in Figure 18, its left window shows
A. Consistency and Verification of XMML
XML format document of metamodel produced by
It is not easy to find a true interpretation for constraint Archware and the corresponding first-order logic system
axiom set FXMML of TXMML to prove semantic consistency in SPASS format generated by translation of MapM is
of TXMML , on the other hand, It is very difficult to derive showed in right window.
grammatical consistency of FXMML by hand-proving due to Metamodel
too many formulas contained in FXMML , so we can only
Archware Metamodeling Tool
prove logical consistency of TXMML based on automatic
theorem prover. Reference to the literature [15], we give Additional
XML Format Metamodel
the following definition. Semantics
Definition 2 (logical consistency of XMML). XMML is MapM
logically consistent iff the constraint axiom set FXMML of Constant Statement Formula
TXMML is proved to be logically consistent in the Generator Generator Generator
automatic theorem prover; XMML is logically
inconsistent iff the constraint axiom set FXMML of TXMML is Formalized System
Constant Statement Formula
proved to be contradictory in the automatic theorem of XMML TXMML

prover, denoted FXMML False. Logic system for metamodel

B. Consistency and Verification of Metamodel


Automatic Theorem Prover SPASS
If TXMML is proved to logically consistent, then XMML
must have an interpretation that can be satisfied, thus it is Figure 17. Logical architecture of MapM
meaningful to discuss properties of metamodels built
based on XMML. From the point of view of
formalization, a legal metamodel is an interpretation that
satisfies all constraint formulas of FXMML , so the
relationship that metamodel satisfies XMML is
equivalent to the relationship that the interpretation of
TXMML satisfies TXMML. By equivalence of satisfaction
relationship and logical consistency, we can obtain
determination method of consistency of metamodel built
based on XMML.
Inference 1 (logical consistency of metamodel). If
union of constraint axiom set FXMML of TXMML and set of
first-order predicate statements TL(M) generated via
metamodel M is logically consistent, then the metamodel
M is consistent; instead, if union of constraint axiom set
FXMML of TXMML and set of first-order predicate statements
TL(M) generated via metamodel M is logically
inconsistent, denoted FXMML TL(M)False, then the
Figure 18. Running interface of MapM
metamodel M is inconsistent.

VI. DESIGN AND IMPLEMENTATION OF MAPM


VII. CONCLUSIONS
Formalization automatic mapping engine for
metamodel called MapM (Mapping of Metamodels) is The papers work derives from Yunnan Province
designed and implemented to finish automatic translation Department of Education Research Fund Key Project
from metamodel based on XMML concrete syntax scheme (No.2011z025). DSMML defined in the informal way
to the corresponding set of first-order predicate cannot precisely describe its structural semantics, which
statements TL(M) in SPASS format [16], thus we can makes it difficult to systematically verify its properties
realize automatic process of analysis and verification of such as consistency. In response, the paper proposes a
consistency of metamodel built based on XMML. Logical formal representation of the structural semantics of
architecture of MapM is shown in Figure 17. DSMML named XMML designed by us based on first-

2012 ACADEMY PUBLISHER


1328 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

order logic. And then we illustrate our approach by on Embedded Software (EMSOFT'06) (October 2006) 53-
formalization of attachment relationship and refinement 62.
relationship and typed constraints of XMML based on [12] Sun XP, A Research of Visual Domain-Specific Meta-
first-order logic. Based on this, the approach of Modeling Language and Its Instantiation, Kunming:
Yunnan University.2008.
consistency verification of XMML itself and metamodels [13] Gu TL, Formal methods of software development, Higher
is presented. Finally, we design and implement the Education Press, Beijing, 2005.
corresponding formalization automatic mapping engine [14] Cheng MZ, Yu JW, Logic foundationfirst-order logic
for metamodel to show the application of formalization of and first-order theory, Chinese People University Press,
XMML. Beijing, 2003.
[15] H. Zhu, L. Shan, I. Bayley, and R. Amphlett, A Formal
ACKNOWLEDGMENT Descriptive Semantics of UML and Its Applications, in
UML 2 Semantics and Applications, K. Lano (Eds). 2008,
The author would like to thank Prof. Hua Zhou, Dr. John Wiley & Sons, Inc.
Xinping Sun and Dr. Yong Yu for valuable discussions. [16] Christoph Weidenbach, SPASS: Tutorial, 2000.
This work was supported by Yunnan Provincial
Department of Education Research Fund Key Project (No.
2011z025) and General Project (No. 2011y214).

REFERENCES Tao Jiang was born in Kunming, China,


in 1973. He received his B.Sc. degree in
[1] Miller J, Mukerji J, MDA guide version 1.0.1. Computer Software from Nanjing
http://www.omg.org/ docs/omg/03- 06-01.pdf, 2003. University, China in 1995 and received
[2] dsmforum, Enterprise apps in smartphones, his M.Sc. degree in Computer Software
http://www.dsmforum.org/phone.html.. and Theory from Yunnan University,
[3] Jackson.E.K, Sztipanovits.J, Formalizing the Structural China in 2003 and received his Ph.D.
Semantics of Domain-Specific Modeling Languages, degree in Information Systems Analysis
Journal of Software and Systems Modeling, 2008. and Integration from Yunnan University,
[4] BE ZIVIN, J., AND GERBE, O, Towards a precise China in 2010. The major fields of his studies involve multiple
definition of the omg/mda framework, in Proceedings of branches of Software Engineering.
the 16th Conference on Automated Software Engineering During 1996-2005, he worked as a Software Engineer in the
(ASE 01) (2001), pp. 273280. Department of Information Technology, China Construction
[5] CSERTA N, G., HUSZERL, G., MAJZIK, I., PAP, Z., Bank. Currently, he is an Associate Professor in School of
PATARICZA,A., AND VARRO, D. Viatra - visual Mathematics and Computer Science, Yunnan University of
automated transformations for formal verification and Nationalities.
validation of uml models, in ASE (2002), pp. 267270. His research areas cover Domain-Specific Visual Modeling,
[6] EVANS, A., FRANCE, R. B., AND GRANT, E. S, Modeling Formalization, Model Verification, Formal Method of
Towards formal reasoning with uml models, in Software Development and Web Application and so on. He has
Proceedings of the Eighth OOPSLA Workshop on more than 20 published scientific papers in international
Behavioral Semantics. conferences and journals.
[7] MARCANO, R., AND LEVY, N, Using b formal
specifications for analysis and verification of uml/ocl
models, in Workshop on consistency problems in UML-
based software development.5th International Conference
on the Unified Modeling Language (2002), pp. 91105. Xin Wang was born in Kunming, China,
[8] W.Andreopoulos, Defining Formal Semantics for the in 1963. He received his M.Sc. degree in
Unified Modeling Language, in Technique Report of Software Enigeering from Yunnan
University of Toronto[C], 2000, Toronto. University, China in 2006. The major
[9] K. Kaneiwa and K. Satoh, Consistency Checking fields of his studies involve Software
Algorithms for Restricted UML Class Diagrams, in 4th Engineering and Data Mining and so on.
International Symposium on Foundations of Information Currently, he is a Professor in School of
and Knowledge Systems (FoIKS 2006)[C], LNCS 3861, Mathematics and Computer Science,
2006. p.19 - 239. Yunnan University of Nationalities. His
[10] R. F. Paige, B. P. J, Metamodel-Based Model research areas cover Model Checking, Formal Method of
Conformance and Multiview Consistency Checking, Software Development, Database Application and Data Ming
ACM Transactions on Software Engineering and and so on. He has more than 10 published scientific papers in
Methodology, 2007. 16(3): p. 1-49. international conferences and journals.
[11] Jackson.E.K, Sztipanovits.J, Towards a formal
foundation for domain specific modeling languages,
Proceedings of the Sixth ACM International Conference

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1329

Framework and Implementation of the Virtual


Item Bank System
Wen-Wei Liao1,2 , Rong-Guey Ho2
Information Management Department, Chinese Culture University, Taipei, Taiwan1
Graduate Institute of Information and Computer Education, National Taiwan Normal University, Taipei, Taiwan2
Email: abard@ice.ntnu.edu.tw, hrg@ntnu.edu.tw

Abstract- Bagua, I Ching was applied to tell the fortune of As a result of the reduction in both testing time and
people in ancient Chinese. In the modern era, we apply tests testing items, many studies have since focused on the
to infer the intelligence, future development direction and application of CAT [2]. Nevertheless, the problems
potential of people. However, it is not easy to design tests, associated with the development of item bank still remain
and the security issue has also become a difficulty for test
unresolved, primarily due to manpower, budget and time
designers. This study has employed Item Response Theory
(IRT) and Content-based image retrieval (CBIR) to constraints.
establish an item bank. There are no actual items in the item Figural tests are comprehensive mental ability testing
bank. What replaces it is a Virtual Item Bank system tools for children and the illiterate. However, it is
(VIBs). In the VIBs, there are only the basic objects and acknowledged that building a figural test can be rather
processes in the VIBs. The items that are created by challenging [3]. There are at least eight figural test
adopting the systems are directly created through the development steps, including designing test
objects and processes. The system completely resolves the specifications, editing items, collecting pre-test data,
security issue of the item bank, and a variety of exercise analyzing items parameters, revising items, selecting an
systems created by adopting the system also have
appropriate scoring method, formal testing, and assessing
considerable help in enhancing students abilities.
the overall success of the test.
Index TermsCBIR, IRT, VIBs Item exposure rate is one of the most important factors
that influence the security of a figural test. The most
I. INTRODUCTION common way of reducing this risk is to impose a
maximum exposure rate. Several other methods have also
Along with computer technology being widely applied been proposed in line with this aim [4] [5]. All of these
in teaching, adopting computers to process tests is already methods establish a single value of r throughout the test.
an important trend. ETS (Educational Testing Services) In this study, we present a new method, known as the
has promoted Computer Based Training (CBT) from Virtual Item Bank (VIB) method, which creates an item
1990. For example, GRE (Graduate Record Examination) bank with unlimited items. We will attempt to describe
has processed tests with CBT from 1992, and from 1993. the implementation of VIB and evaluate its performance
IRT is also combined and tests are implemented in with an empirical experiment. In this way, item exposure
Computerized Adaptive Testing (CAT). TOEFL (Test of rate is always 0. Hence, the problems associated with
English as a Foreign Language) computer version started item exposure can be resolved.
to be implemented from 1998, and Taiwan also started to
apply CAT from 2000 (TOEFL-CBT). ETS changed II. LITERATURE REVIEW
TOEFL-CBT to TOEFL-IBT from 2006, and the old
computer TOEFL was then put out of action [1]. The study develops the virtual item bank system by
The greatest difference between CBT and CAT is that referring to the relevant studies of IRT, CAT, DATA
CAT will alter along with the previous questions Mining, and the automatic item-generation system in
answering status of the test taker immediately, the entire computer-based figural testing. The related literatures are
test is specially designed according to the test takers as follows.
ability and skill, which is, according to the different A. Item Response Theory
abilities of the test takers, different questions will be
offered. In short, if the test taker answers the first IRT is a series of mathematic models mostly used to
question correctly, the second question will be harder. On analyze the scoring of tests or questionnaire data. The
the other hand, if the test taker answers the question objective of these models is to determine if the latent trait
incorrectly, then the second question will be easier. expressible through test. These models are currently used
During the process, the difficulty level will be adjusted extensively in psychological and educational
according to the answering status of the test taker to measurements. IRT was developed in 1960s by Danish
select the questions that are most suitable to the test statistician, Georg Rasch, and American psychological
takers current ability, and the process will be repeated statistician [6], Frederic M. Lord [7], simultaneously in
until the predetermined standard is achieved (or the their respective country. Despite of the different study
measurement error is within the tolerance level). approaches applied, their results were quite similar.The
IRT model is as in below:

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1329-1337
1330 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

t2 B. Computerized Adaptive Testing


a ( b )
e
P( ) = c + (1 c) 2 dt In this research, CAT theory was applied in the CAT
2 ..(1) system, turning measurements into tailored tests. CAT is
This model is named as 3-parameter Normal-ogive very different from traditional tests because it selects the
model (3PN) by Lord. To practically simplify the most appropriate items for examinees based on their
numerical treatment, 3-parameter Logistic model (3PL) is abilities or characteristics. If an examinee gets a right
used more often. The model is as in below: answer, a more difficult item will be selected; on the
(1 c) other hand, if examinee gives a wrong answer, an easier
P( ) = c +
1 + e Da ( b )
D is the constant 1.7(2) item will be asked. Item Response Theory (IRT) just
The curve based on these two models is Item provides serious concept foundation for CAT.
Characteristic Curve (ICC), which describes the In general, CAT procedures include three important
relationship between the possibility of successfully parts: test starting and ending points, ability evaluation
solving a specific item in the test and examinees and item selection, and the result [9]. After determining
ability (which is denoted as in the function). There are the starting and ending points, it begins with the first
three parameters in above two models, a, b and c. item; after receiving the answer, it evaluates the ability of
Parameter C is named as the guessing parameter. As examinees and selects the most suitable questions for the
indicated below, c represents the lower limit of ICC, next item until it reaches an ending condition. The flow
meaning intuitively that c is the guessing ability, the chart of the item selection and ability evaluation is shown
probability of an examinee making a good guess even in Figure 1, and the followings are the discussions.
though his ability is extremely low, closing to negative
Begin with Provisional
infinity. Proficiency Estimate
b is named as item difficulty. B is the value of at the
point the maximum slope on the ICC. To ICC with a
lower limit of 0, b stands for an examinees ability at the
probability of 0.5. The change in B leads to a shift of ICC Select & Display Observe & Evaluate
Optimal Test Item Response
to either the right or the left without altering its shape.
For example, a decrease in the value of b leads to a left No
shift of ICC, meaning that the test becomes easier.
Is
a is the item discrimination. The value of a/4 is the Revise Proficiency
Stopping Rule
Estimate
maximum value of slope. A minor change in the value of Satisfied ?
ability leads to a hugest change in P at this point. Yes

End of No Administer
End of Testing?
Test Next Test

Yes

Stop

Figure 2. Flowchart of Computerized Adaptive Testing

The selection of starting items in CAT is very


Figure 1. ICC Function important because a suitable beginning item can decrease
the length and time of the tests. There are three methods
The model proposed by Rasch is as in below: of determining starting and ending points for common
e( b ) multiple choice tests:
P( ) = (1) Medium difficulty item: in general, examinees are
1 + e( b ) (3) in the medium level; thus it can be assumed that
It is a multiplicative gamma model for reading speed the ability of examinee is of average degree, and
originally proposed by Rasch (1960) in his monograph the system can be started by selecting median
Probabilistic Models for Some Intelligence and difficulty items.
Attainment Tests [8]. Some IRT researchers regard (2) Random selection: the computer can randomly
Raschs model as a special case of the 3PN model with select items, their difficulties are between -0.5 and
the parameters c and a at the value of 0. Others consider +0.5.
Raschs model completely different and that the model (3) Examinees data: the computer determines the
really demonstrates the definition of measurement starting and ending points according to examinees
because and b were defined respectively as the number age, intelligence, grades, characteristic, and other
of correct responses and the correct response rate for a data.
specific item when the model was proposed. Besides, In this research, the medium difficulty item was
Raschs model is more concise. selected for the starting points. Items being generated by

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1331

the medium difficulty item generation rules were the first In this study, the Bayesian method was used to
items in the CAT system. evaluate the ability, and the Maximum information
The step of Ability Estimation and Item Selection was strategy was used to select the next item.
a recursive process for the ability estimation and
C. Relevant studies of the automatic item-generation
selection. After examinee answered the questions, his or
system for computer-based figual testing
her ability would be re-estimated, and the item selected
based on the estimation until the ability evaluation was Computer-based figural testing has been widely
accurate. The most commonly used ability estimation employed across various institutions, such as the Online
methods were the MLE and Bayesian Model in IRT; and Testing Center (http://www.onlinetest.org/), the center of
the item selection strategies were Maximum information Applied Psychology at Beijing Normal University
strategies and Bayesian strategies [10]. (http://www.bnufr.com), and commercial web sites like
MLE was easier in terms of ability estimation. It could IQTest (http://www.iqtest.dk/). These organizations
estimate the examinees ability accurately when the provide useful computer-based figural testing tools and
number of items was sufficient; however, if the analytical (analysis) tools for researchers. However, only
examinees appeared abnormal (e.g. getting all the online versions are provided.
answers right or wrong answers), it would not end [11]. Lin (2001) has researched computer adaptive figural
The formula is: testing since 1998 [13]. His researches are based on the
d analysis of Ravens Advanced Progressive Matrices
[ ln L(u | )]m
m +1 = m d (APM) test structure, besides being responsible for the
2
d development of the New Figure Reasoning Test (NFRT).
[ 2 ln L(u | )]m +1
d . NFRT contains two main systems: the automatic item-
(4) generation system and the online testing system. The
: The ability of the examinee online testing system based on IRT theory is just an
u: The response pattern of the examinee, u = 1 means interface for collecting and evaluating the ability of
the item response is correct, and 0 means incorrect examinees. The point of this study is an automatic item-
answers generation system which will be discussed in the
The Bayesian Model assumed that the ability posterior following paragraphs.
chance ratio was the product of the maximum function An automatic item-generation system contains an item
and prior ability. generation algorithm and an item-generation engine
posterior likelihood prior ..(5) based on APM. The functions, strengths and restrictions
of this system are described as follows:
It could prevent from being unable to end. But the (1) Item generation engine: The engine can
efficiency was lower than that with MLE, and it had a
automatically generate a specific item with
regressive effect, which might lead to deviation [12].
particular content features, and combine different
In terms of the selection strategy, Maximum types of geometric figures in a systematic fashion
Information Strategies was commonly used. Since for producing and measuring the item which
information amount and test deviation were negatively
matches the goal. The purpose of the measurement
correlated, the same item provided different amount of
was to evaluate examinees reasoning ability on
information to examinees with different ability; and the conclusion (inference on relations) and
different items provided various information amount to deduction (inference of relativity) through the
examinees with same ability. Thus, the selection should
figure partition characteristic of the item and the
be below the ability of the examinee, and the item that
manipulation of the relationships between figures
provided the most information should be the next item. in space. An example item of APM is shown in
The above mentioned was the principle of Maximum Figure 3.
information strategy.
In CAT, different examinees have differentiated test
length. In general, there are three methods to end the
tests.
(1) Set the maximum number of items, namely, preset
the test length. After the examinee finishes the
maximum number, the test is over.
(2) Set the minimum error standard: when the
examinees ability deviation is lower than the
minimum deviation, it means the ability
estimation is stable, and leads to the end of the Figure 3. An example item of APM
test.
(3) No more suitable items in the item bank: if none (2) Item generation algorithm: The algorithm for
of the items can provide more information, it item-generation was based on the understanding of
means the additional item does nothing to the the analysis of features in APM items. The key
ability estimation. Then, the test is over. points were the parameters in IRT theory and the
problem solving processes of APM.

2012 ACADEMY PUBLISHER


1332 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

The IRT parameters of APM are discussed as follows: shape clusters. When the point in the shape cluster
(1) Difficulty: According to Hambleton and is less than 50 pixels, the shape cluster can be
Swaminathan (1985), the value of item difficulty ignored. In addition, to avoid false determining a
parameter was set to between -2.0 and 2.0. Based thin line as a cluster, the minimum density value
on this criterion [4], the average difficulty of APM of the shape cluster (see function 9) is set as the
items was -0.868, and between -2.0 to 2.0. threshold of various shapes. If it is smaller than
(2) Discrimination: In terms of ability tests, the value the value, then the shape cluster should were
of discrimination paratemter was more than 0 and ignored.
relatively low in APM, and the item 8 had the population of Cluster
lowest discrimination (0.014). =
(3) Guessing: According to the estimation, the density (l max ) 2 ..(9)
supposed value in APM items was 0.219. Since l max =max (||x - x ||, ||y - y ||)
there were 8 choices in APM, the predicated value 2 1 2 1
should be 12.5%. The average guessed value was (x2, y1) and (x2, y2) are the corner points of the
higher than expected. shape cluster.
(III) Similarity level calculation:
D. Relevant studies of selection verification Respectively calculates the similarity level of
In selection verification, what the test administrators color and shape according to the calculation
care the most about is the accuracy and difficulty level of function of color and shape distance (see function
the options. Figural testing verification is much more 10 and 11), and then calculates the similarity level
difficult than text testing. While the multimedia science of the two integrated features according to
becomes more and more developed, the study employs function 11.
the technology of content base image retrieval for coldis(C , C ) = ( R R ) + (G G ) + ( B B )
Q I Q I 2 Q I 2 Q I 2
i j i j
(10) i j i j

selection verification. The related technologies are as


Figure Q has m color clusters, and p shape clusters.
follows:
Figure I has n color clusters, and q shape clusters.
(1) function without color characteristics: 7
shpdis(C iQ , C Ij ) = (m m )
Q I 2
The simple eigenvector fl can be used to represent i i
i =1
... (11)
the figure when no color features are considered.
fi= (i1, i2, i3, . in)...........(6) I is the moment invariants
f is the eigenvector of figure i, and n is the content D (Q, I ) = 1 1 + 2 2 + 3 3 + 4 4
(12)
feature number. The similarity level of the two max( m , n )
figures calculates the Euclidean distance of the 1 = cdist(C
i =1
Q
c ,i , C cI, Pc ( i ) )
eigenvector (as shown in function 7). The closer the
max( m , n )
value is to 0, the higher the similarity level of the two 2 = ( c ,i c , Pc (i )) 2
Q I

figures is. The greater the value is, the lower the i =1

similarity level is. max( p , q )

n Q I 2
3 = shpdist(C Q
cs ,i , C csI , Ps ( i ) )
( f j
f j
) i =1

max( p , q )
,
d(Q ,I)= j =1 (7) ( s ,i s , Ps ( i ) )
4 =
Q I 2

(2) function that consider the color features: i =1


If colors need to be considered, then other methods 1 , 2 , 3 , 4 are the weighted index,
must be employed. Mehtre, Kankanhalli and Lee
(1998) proposed to comprehensively consider the
two eigenvalues of figure color and shape, calculate Pc is the closest color cluster assignment function, it
the similarity level of the figure, and apply logo can maps every color cluster i of image Q to the closest
comparison as the study subject. The steps of the color cluster Pc (i) of image I.
proposed method are as follows:
(I) Look for the color clusters in the figure. The III. METHODS
calculation of the color distance is shown as The objective of this study is to propose a new concept
function (8). While processing clustering on the VIB, and to show how this concept is used in the CAT.
400400 figure color used in the experiment, the The following is a discussion on the problems and the
color distance minimum threshold between each demands of item bank generation we encountered in
clusters is set to be 50. addition to the development of the research tools.
Color distance= ( R ) + ( G ) + ( B ) ...(8)
2 2 2

A. Problems and demands of the item bank generation


(II) Look for the shape clusters in the figure, first
divide the classified color clusters into several The item bank consists of calibrated, analyzed,
layers according to step (I). The color cluster categorized, and evaluated items. Millman and Arter
number is the layer number. Mark the shape (1984) believed that items would be more computerized
clusters of each layer figure, and sort the shape in the future. The IRT-based item bank was estimated to
cluster, in descending order, on the figure of each have the following advantages:
color layer according to the pixel amount of each

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1333

(1) It allowed test editors to edit items for all purposes quality, safety, and teaching. Thus, this study expected to
without any restraints. design a new concept for an item bank containing an
(2) It allowed test editors to edit tests with the proper abundant numbers of items with the fair quality to solve
amount of items in the range of the item bank the problems mentioned above.
[15].
B. Development of the research tool
Thus, an item bank has the potential to improve the test
quality. However, we often face the following problems This research has developed two research tools: Virtual
while building an item bank: Item Bank System and CAT system. The system structure
(1) Number of items: and functions of these two tools are described as follow.
In general, it is better to have more items. But it (1) Virtual Item Bank System (VIBS): In VIBS, the item
should be taken into consideration whether the database no longer stores large amounts of items;
items quality has reached the test editors instead, it saves two elements to replace the
requirements as well as achieving the purpose of traditional items:
the test. Researchers suggested that every concept (I) Basic figure object: This system no longer
must include 10 items, and every course unit had requires saving a large amount of figural items.
to contain 50 items. Reckase (1981) recommend Instead, items were built upon three basic figure
100 to 200 difficulty parameters distributed types: line, circle and multilateral. Not only does
evenly, and items with the discrimination this lower the memory space requirements, but it
parameter. If this standard could be reached, it also reduces the probability of item exposure.
could be used for computerized adaptive testing (II) Processes: The examinees solving processes and
[16]. abilities were defined by specialists and converted
(2) The categories of the item bank: to mathematical formulas which could be
The most common categorization is one using the manipulated by computers and stored in the
theme or instructional goal, and the other is using hypothetical item database. Using this data along
key words to search. In general, using key words with the basic figure objects, the computer can
is more flexible and could be used for certain produce mass items and lower the work load for
purposes, content, age, and thinking style. test preparation.
(3) Scaling parameters of items: The VIB which replaces the traditional item bank is
Scaling parameters are designed to calibrate item illustrated by the flow chart below.
parameters like difficulty, and convert them to the
Identify the ability to be tested.
same scale. In the test of a large sample, scaling
parameters are necessary; however, they could be
omitted for individual tests. Analyze the ability and process the subject
(4) The problem of public access: and the item that needs to be solved.
It seemed that the teaching might be limited to the
content of the item bank if the teachers use the
item bank as assessment tool freely. But if the Convert the ability needed by the subject
into mathematical functions, and store
item bank was large enough, this problem could them in the database.
be ignored because teachers were unable to limit
the teaching to the item bank content. On the other
hand, if the item bank was not large enough, The system analyzes the subjects testing
opening the item bank might lead to narrowing the needs and produces items according to
Figure
the formulas 1. database.
in the
focus of teaching. Thus, it must be considered
whether the item bank should be open or not. But,
opening a few item samples could help both Figure 4. Flow chart of VIB
teachers and students to understand the testing
method, something both necessary and correct. The VIBs contains three subsystems: item rule
(5) Security problems of the item bank: definition subsystem, item generation subsystem, and
Item banks could make test editing and scoring answer retrieval subsystem. Each subsystem has different
easier; however, it requires repetitive use of the tasks and functions and is described below.
item bank and can interfere with the item security (1) Item rule definition subsystem:
(such as through appearance of old items). This This subsystem provides test editors with a number
has to be taken into consideration if item bank is of figural objects, processes the needed information
small; on the contrary, this concern could be to solve the problem. Through the system interface,
ignored if the item bank is large enough. In users can determine the figures position on the
addition, updating the items constantly to ensure systems interface and choose the method to process
content validity and statistical quality is another images. The subsystem then estimates the item
way to ensure the item bank security. difficulty and asks the test editors to adjust the
Based on these considerations, we found that a test difficulty level. Finally, the item initiation subsystem
with a sufficient number of items could be helpful in would save this information into the data.

2012 ACADEMY PUBLISHER


1334 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Referring to the terms of parameter estimation (with ( f


Q
f
I 2
)
respect to the IRT parameters), test editors need to d (Q ,I)= .(13)
consider examinees experience, required ability, age The VIBS is composed of those subsystems which
and other factors since these are a number of the control the item shape, item difficulty, answer, and all
strands that could affect item difficulty. parameters.
The study expected to help test editors to
C. System interface
automatically define the difficulty level of items, and
lessen their burden. This study has simplified most The system interface and function are as follow:
factors while deriving parameters and analyzing the (1) Decide the location of figural objects
amount of objects needed (items needed) and image
processes. Other factors will be analyzed in the later
sections.
In terms of the parameter estimation, there are three
methods based on various parameter conditions:
(I) If the item parameter is already known and only
ability parameter is needed to be estimated: the
MLE and Bayesian procedure were commonly
used [17].
(II) If the ability parameter is known, but we need to
estimate item parameter: we used MLE and
Bayesian Procedure [17].
(III) If the item and ability parameter were both
unknown: we used Joint Maximum Likelihood Figure 5. The interface of decide the location.
Estimation (JMLE), Marginal Maximum
Likelihood Estimation (MMLE), Bayesian Model (2) Decide the processes of objects
or Maximum a Posteriori Estimation (MAP),
Bayesian Mean or Expected a Posteriori
Estimation (EAP) to estimate item and ability
parameter [18].
In this study, the ability parameter was used to
estimate since ability parameters were unknown.

(2) Item generation subsystem:


The main function of this system was to generate all
kinds of data in the item generation subsystems in the
hope of producing an infinite number of items. It contain
the main functions of the item generation system are:
(I) Defining the needed abilities and strategies in
order to solve the item.
Figure 6. The interface of decide rules (processes)
(II) Determining the object shape of each item
(III) Identifying the difficulty parameter (3) Step 3: Choose the next figural objects and save
(IV) Parameter conversion : The system converts the
them into the VIB.
data mentioned above into mathematical formulas,
and saves them in the VIB.
(V) Automatic generation: The item generation
subsystem can automatically generate items
according to the defined strategy, difficulty level,
and selection.

(3) Answer retrieval subsystem :


Alternative options of each item were generated by
image comparison. First, we computed the RGB value of
the figures pixel as the characteristic value. Then, we
saved the figure characteristic into a 2-dimension matrix,
and compared it with figures in the database. The
similarities of the two figures were used to calculate the
Euclidean distance (as shown in function 13) of the Figure 7. The interface of save function
characteristic value, and we selected the lowest three as
the alternative option. According to the abovementioned Processes and
Element definitions, the system will save it as an XML

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1335

file, and when reading VIBS through CAT, the following IV. RESULTS
tests can be generated.
A. System implementation
The study system employs the Internet 3-tier (Browser-
WEB-Application) client/server architecture according to
the study purpose and system analysis and design. The
following respectively describe the tools and technologies
adopted in the system implementation, database design,
and system program verification:
(1) Development tool technology: C#, XML
(2) Database design architecture:
The system adopted the XML file format as the
backend database. Automatic generated test
questions and the functions of adding, searching,
amending, and deleting related information can be
achieved through the operation of the XML format.
The XML format can be coupled with various server
operating systems and Web servers, and is suitable as
system backend storage tool.
Figure 8. The demo of Computer figural test with VIBs (3) System Algorithm
The following are algorithms for the Item Initial
System and Item Generation System.
(I) Item Rule Definition subsystem
This system is mainly based on the binary
operation in image processing theory. When the
test editor defines the figure objects location and
image processing operations, the System will
generate the result and storage it into Virtual Item
Bank. The pseudo code for system algorithm is as
follow:

INPUT:
1. define objects as {line,circle,polygon}
2. ni [xi.. xj,yi.. yj] n belons to {line,circle,polygon} i
belons to {1..10}.xi,xj,yi,yj belongs to {1..7500
(pixel)}}
Figure 9. The demo of Four Arithmetic Operations Test 3. Oi belonging to image process operations {Or, And,
with VIBs Xor, Sub, Color, Size}
4. Pi belong to difficult parameter {-3..+3}

OUTPUT:
1. ri [xi.. xj,yi.. yj] r belonging to {line,circle,polygon}
i belonging to {1..10}. xi,xj,yi,yj belonging to {1..2500
(pixel)}}
2. F [i] F is item generation function which is storage in
Virtual Item Bank.
3. STEPS
Get the location of normal figures nj (xi,yi)
Figure 10. The demo of Cube counting Test with VIBs Get the image process operations oi
Get the location of figures which want join image
The figure above represents the issues of the problem process nk xi,yi)
and demands of item bank generation, in addition to the Get the difficult parameter pi
development of research tools. The research tools helped For every j belonging to n :
test editors to solve the problem of the item exposure rate. For every i belonging to o :
A simulation of the item overlap rate will be discussed rj=nj oi nk
and proved in the following section. function:F (j)= Rj and pi

2012 ACADEMY PUBLISHER


1336 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

(II) Item Generation System


Table 3 Item overlap frequency (times) of each rules
INPUT :The locations of Figure Objects, Difficult Rule frequency Rule frequency
parameters.
1 0 7 0
OUTPUT: Figural Items.
2 0 8 0
STEPS : (initializations).
Pi=Difficult parameters of Itemi 3 0 9 1
ri=location of final figures
4 1 10 0
Random select a type of polygon. Si
If Si ever be selected then 5 1 11 0
Begin
6 0 12 0
Record the type of Si
The simulation results proved that VIBS solves the
Random select another Si
problems of item exposure.
End
Random select a item shape from Rule
V. DISCUSSION AND CONCLUSION
Random select a direction belonging (Top to Down,
Down to Top, Right to Left, Left to Right) From the results of item overlap simulation, it is
Generate the item obvious that the VIBs can resolve the problem of item
Generate the answer exposure efficiently. Every examinee got different items
Doing answer Image Data Retrieval on the same test. This allows the VIB to be used not only
Get the perfect answer in measurement but also in practice. The results of the
experiment showed its evident effects in practice.
B. Test Security
In the VIB, the item was generated dynamically. It was
In this study, an item overlap simulation was however difficult to apply it in the CAT system. In order
conducted. According to the item overlap rate (given in to solve this problem, two CBT testing systems were
formula ix), when max length of the test = 12, subjects = designed to collect the item difficulty parameters of the
30000 , number of objects = 1, process of the item item generation rules.
generation = 12 , the simulation results are as follows. The study has also encountered some problems. For
T CN 2TO example, in study tool development, some test designers
Rt = N O 2 = N
think that it is difficult to operate, and the method of the
( Li ) N ( N 1) Li
i =1 i =1
questions cannot be correctly entered into the system,
.(14
such as pentomino. Some problem solving and test
)
combination methods are extremely complicated and the
R t test overlap percentage human and material resources cost to input them into the
TO the total numbers of items that both subjects overlap system is even more than designing the test, such as
Li the test length of the ith subject English grammar tests. However, in difficulty estimation,
some test designers also found that difficulty estimation
Table 1. Results of the item overlap rate simulation is difficult. Some question types are similar; but the
difficulty level is completely different, and therefore
Item overlap rate (R) 1.714321-10 difficulty consideration is still insufficient.
In addition, some test subjects proposed the issue of
Mean of test length 9.3012
the distracters being too difficult. Because some
Mean of Theta-Estimated -0.134 distracters are generated through Image Data Retrieval,
some distracters are too similar and result in the test
Mean of SE 0.3017
subject making an incorrect decision, further impacting
their score. In addition, the change of questions is
sometimes very small, which also easily results in the test
Table 2. Use frequency of each item-generation rules subjects giving incorrect answers. Finally, some test
Rule frequency Rule frequency subjects proposed the recommendation that because VIB
will almost not generate question exposure, VIB can be
1 19321 7 18765 employed in the practice system, and after a large amount
2 23012 8 17862 of practice, some test subjects score will make
significant progress.
3 17632 9 19122
4 18453 10 17280
5 19865 11 22009
6 20121 12 21776

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1337

[10] Baker, F. B. (1985). The basics of item response


REFERENCES theory. Portsmouth, NH: Heinemann.
[11] Hambleton, R. K., & Swaminathan, H. (1985). Item
[1] Yang, H. L., & Ying, M. H. (2005). Could On-line Testing
have the Same Effects on Scoring as Paper-and-Pencil
response theory: Principles and applications. Boston,
Testing? Journal of Taiwan Normal University: MA: Kluwer-Nijhoff.
Mathematics & Science Education, 50(2) , pp. 85-107 [12] Ho, R.-G. (1989). Development and implementation
[2] Ho, R. G., & Hsu, T. C. (1989). A comparison of three of the CAI software database and interactive
adaptive testing strategies using MicroCAT. Paper evaluation system. Paper Presented at the 1st 1989
presented at the annual meeting of the American International CAI Conference. Taipei, Taiwan, ROC.
Educational Research Association, San Francisco, CA. (Invited Speech)
[3] Cronbach, L. J. (1990). Essentials Of Psychological Testing. [13] Liu, Z. J., Liang, R. K. & Lin, S. H. (2001), Automatic
New York, Harper Collins Publishers Item-Generation and Online Testing System for New
[4] Sympson, J. B., & Hetter, R. D. (1985, October). Figure Reasoning Test, 5th Global Chinese Conference on
Controlling item-exposure rates in computerized adaptive Computers in Education / International Conference on
testing. Proceedings of the 27th annual meeting of the Computer-Assisted Instruction 2001, pp. 326-333
Military Testing Association (pp. 973977). San Diego, [14] Mehtre, B. M., Kankanhalli, M. S., & Lee, W. F. (1998).
CA: Navy Personnel Research and Development Center. Content-based image retrieval using composite color-shape
[5] van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2006). approach, Information Processing & Management, 34(1),
Assembling a CAT item pool as a set of linear test forms. pp. 109-120
Journal of Educational and Behavioral Statistics, 31, pp. [15] Mehtre, B. M., Kankanhalli, M. S. & Lee, W. F. (1998).
81100. Content-based image retrieval using composite colour-
[6] Rasch, G. (1960). Probabilistic models for some shape approach, Information Processing & Management,
intelligence and attainment tests. Copenhagen: The 34 (1), pp. 109-120
Danish Institute for Educational Research. [16] Reckase, M. D. (1981). Tailored testing, measurement
[7] Lord, F. M. (1980). Applications of item response problems and latent trait theory. Paper presented at the
theory to practical testing problems. Hillsdale, NJ: annual meeting of the National Council for Measurement
Lawrence Erlbawn Associates. in Education, Los Angeles.
[8] Jansen, M. G. H. (2003). Estimating the parameters of [17] Baker, F. B. (1992). Item response theory: Parameter
estimation techniques. NY: Marcel Dekker.
a structural model for the latent traits in Rasch's
[18] Hambleton, R. K., & Swaminathan, H. (1985). Item
model for speed tests. Applied Psychological response theory: Principles and applications, Boston:
Measurement, 27(2), 138-151. Kluwer-Nijhoff.
[9] Wainer, H. et al. (Eds.). (1990). Computerized
adaptive testing: A primer. Hillsdale, NJ: Lawrence
Erlbaum Associates.

2012 ACADEMY PUBLISHER


1338 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

An Approach to Automated Runtime Verification


for Timed Systems: Applications to Web
Services
Tien-Dung Cao1 , Richard Castanet2 , Patrick Felix2 , and Kevin Chiew1
1
School of Engineering, Tan Tao University
Duc Hoa District, Long An Province, Vietnam.
Email: {dung.cao, kevin.chiew}@ttu.edu.vn
2
LaBRI - CNRS - UMR 5800, University of Bordeaux
351 cours de la liberation, 33405 Talence cedex, France.
Email: {castanet, felix}@labri.fr

AbstractSoftware testing plays an important role in when several sessions are executed in parallel. Moreover,
verifying and assessing the quality of a software application. because testing cannot find all faults, even if a system has
There are various testing approaches proposed for different passed an active test, we still need to verify its conformity
application scenarios. In this paper, we propose a new
passive testing approach to verifying a timed trace with at running time or to analyze its log files for improving
respect to a set of constraints. With the extension of Nomad the reliability of a system. (b) Passive testing collects the
language, we are able to formally describe all constraints observable traces (i.e., the log files) of the running system
and combine conditions by logical operations AND and by installing a probe and analyzes them based on a set of
OR into expressions. By well organizing and evaluating the rules [8], [9], [21] or a formal specification [12]. Without
expressions, we are able to carry out runtime verification
message by message in a timed trace. In addition to the a tester directly interacting with the IUT, passive testing
theoretical framework, we have also developed a software does not effect the system running, and is widely adopted
tool known as RV4WS (Runtime Verification for Web for system verification.
Services) for the automation of our testing approach, and Passive testing can be carried out either on-line or off-
implemented all algorithms in the paper with this tool. We line. The on-line technique, a.k.a. runtime verification
conduct a case study of web service composition to verify
the effectiveness of our approach and tool. technique, immediately checks an observable trace once
an input/output event occurs so that potential damage can
Keywords-Runtime verification, Passive testing, Rule spec- be prevented by terminating the system running whenever
ification, Web services.
any fault is detected; whilst the off-line technique checks
an observable trace after it is collected for a period of
I. I NTRODUCTION time, and does not usually require additional resources
The activity of conformance testing focuses on veri- such as CPU, RAM or another computer to run both of the
fying the conformity of a given implementation based trace collection engine and checking engine in parallel.
on its specification. It can be classified into two cate- For a complex system such as an SOA application or a
gories, namely active testing and passive testing. (a) Ac- cloud computing application, the communications across
tive testing requires a tester to interact directly with system components are carried out by signals, events, and
the implementation under test (in short as IUT) and messages, whose timed traces may be collected from a
check the correctness of answers by the implementation. distributed environment and need to be well synchronized
However, this method is not applicable to a running during verification. Therefore, we suggest to address the
system due to some reasons like (1) testers do not have following factors when we define a set of constraints to
permission to access to the interface of a running system; verify a timed system.
(2) if testers use the active method to test functions Time constraint. The passive testing verifies the mes-
like create new account(...) or update debit(...) of a sage sequence in a trace. However, when a system
banking system, it may incur errors like false accounts is running, we do not know when the next message
or updates in the database of the system; and (3) active will arrive after the previous one. Thus we may have
testing does not allow us to check several security prop- to set time constraint for each message. For example,
erties of a system that can only be captured at runtime or we can set 10 seconds of time constraint for receiving
a loginResponse after sending a loginResquest.
This paper is based on Automated Runtime Verification for Web
services, by T.-D. Cao, T.-T. Phan-Quang, P. Felix, and R. Castanet, Condition on message content. Sometimes we are
published in the Proceedings of the IEEE International Conference on only interested in some messages of which the
Web services, Miami, FL, USA, July 510, 2010. c 2010 IEEE. contents satisfy some conditions. For example, we
This Research is partly supported by the French National Agency of
Research within the WebMov Project http://webmov.lri.fr. can identify the messages sent to or received from
* Corresponding author. machine A by their contents (SourceIP = A) or

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1338-1350
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1339

(DestIP = A).
Data correlation. For any observable trace mixed
by several traces or sessions that are executed in
parallel, we need to apply our constraints on the
messages that belong to an individual trace or ses-
sion. To do so, we firstly find the messages that have
a correlation by their data values (known as data
correlation), and then apply our constraints on these
messages. For example, we can assign sessionId
fields to messages belonging to different sessions
running in parallel, can group these messages by the
values of sessionId field before applying our rules
for correctness verification.
Combination of conditions. A constraint can also be
represented by a combination of several conditions
with logical operations such as AND, OR, and NOT. Fig. 1. Classification of testing types
In this paper we propose a new approach to passive
testing either on-line or off-line for a timed system by
We classify the types of testing into four categories based
verifying a timed trace based on a set of rules which
on the characteristics of the application, the phases of the
contains the constraints on message sequence, the interval
development, the available information of specifications
time between any two messages, and the contents of mes-
and the capability of application controls, and use a
sages. To formally describe constraints for the specifying
schema with four axes to show the classification as
permissions and prohibitions, we propose to extend the
depicted in Figure 1.
Nomad [14] language by defining the constraints on each
atomic action (fixed conditions) and a set of data correla- 1) The characteristics:
tions between the actions, so as to describe permissions Conformance testing. It is used to test the confor-

and prohibitions both of which are atomic actions and mance of an implementation based on its specifica-
should be applied immediately and obligations which are tion.
related to non-atomic actions within contexts and need a Robustness testing. It is used to test the capability to

time duration to complete. For example, let x be a positive deal with the unexpected data.
integer, a prohibition or a permission rule is evaluated to Performance testing. It refers to the assessment of

be true at time t if t [0, x]; whereas an obligation rule the performance of an application in different cases
is evaluated to be true at time t if t > x, meaning that in terms of the speed and effectiveness.
the obligation needs at least a duration x to complete the Security testing. It is a process to determine that

work. Besides the theoretical framework, we develop a an information system protects data and maintains
software tool known as RV4WS (Runtime Verification for functionality as intended. Some security concepts
Web Service) to implement the automation of our passive that need to be covered by security testing are listed
testing approach. In particular, the algorithms presented in as follows.
this paper are fully implemented by this software tool. We Authentication, which is the process of estab-
also apply our tool to a case study of WebMov1 project lishing the identification of a user.
which provides design and composition mechanisms for Authorization, which is the process of deter-
web services. mining that a requester is allowed to receive a
The remaining sections are organized as follows. We service or perform an operation.
first present some discussions about software testing and Availability, which is to assure information and
existing method for passive testing or runtime verification communications services be ready for use upon
in Section II, and then introduce the syntax and semantics requests or the information kept available to
of our rules in Section III followed by an algorithm authorized users when they need it.
for verifying a timed system based on a set of rules in Integrity, which is a measure by which receivers
Section IV. In Section V, we introduce the RV4WS tool can determine the correctness of the information
together with a case study in Section VI before concluding provided by the system.
the paper in Section VII. Reliability testing. It evaluates the good functions
under different conditions such as timing constraints,
II. D ISCUSSION
speed of network, etc.
A. Software Testing
2) The phases of the development:
Testing is an important step to verify and assess the
Unit testing. It is to verify the operation of an
quality of a software application, and an appropriate test-
individual component or module in isolation to the
ing type should be chosen for an individual application.
rest of the system.
1 http://webmov.lri.fr Integrated testing. It is to test the interactions

2012 ACADEMY PUBLISHER


1340 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

amongst components of a system. In other words, It only supports future times not past time. This is
it tests a system at the interface level of each because its semantic is defined as if we obtain a
component. trace of the pair of input/output event (the interval
System testing. It is to verify the global behavior of time between an input and an output is also consid-
the system. ered) and we continue to obtain an input (after this
3) The accessibility: trace), then we must obtain an output after a fixed
interval time.
Black-box testing. It allows testers to generate test
It does not support combining several conditions into
cases from system specifications for functional test-
a Timed Invariant by the logical operations such as
ing without knowing the internal structure of system.
AND and OR.
Gray-box testing. It is used when some information
It does not consider the constraints on the content
of the internal structure are available for testers.
of each event, therefore the data correlation problem
White-box testing. It is used when testers know
between the events is also not considered.
the internal structure of the system (i.e., the code)
Finally, the tool PasTe [1] that is implemented to
and allows testers to verify the structure by testing
check the correctness of a log w.r.t. a set of time
different paths in the code.
invariant does not allow us to verify an execution
4) The controllability: trace in parallel with the trace collection engine,
Active testing. It allows testers to interact directly i.e., not supporting runtime verification or on-line
with the system under test by sending requests and checking.
receiving responses for analysis.
Passive testing. It allows testers to assess a system Mallouli et al. [16] proposed the security rule using
from input/output events or log files without inter- the Nomad language to express the constraints on a trace
acting with the system under test. with obligations, prohibitions and permissions. That is,
a prohibition or permission rule is granted and applied
immediately to a trace; while a obligation rule delimits
B. Passive Testing of systems
the completion deadline of a task. They also introduced
Due to its side-effect to a system, passive testing is an algorithm to check the correction of the trace follow-
usually used as a monitoring technique to detect and ing these security rules. Their approach solves the time
report errors when we cannot use an active testing method. constraints caused by invariant approach though, it does
Another area of its applications is in network management not consider the correlation of messages by its data values
for the detection of configuration problems, fault identi- which is an important issue for passive testing.
fication, or resource provisioning. This section reviews
Tabourier and Cavalli [22] proposed an approach to
some passive testing approaches.
verify the traces which actually belong to the accepted
Bayse et al. [9] and Cavalli et al. [12] proposed a pas-
specifications provided by an FSM. This method is com-
sive testing approach based on invariants of a Finite State
posed of two stages:
Machine (FSM). They defined two types of invariants in
the following for an FSM M = (S, sin , I, O, T ) where
Firstly, passive homing sequence is applied to de-
S is a set of finite states, sin an initial state, I the set of
termine the current state. Initially, all states are put
input actions, O the set of output actions, and T the set
into a candidate list. When an input/output arrives,
of transitions.
the current state will be updated by the destination
Simple invariant. Trace i1 /o1 , i1 /o1 , ..., in1 /on1 state of the corresponding transition if it is the source
, in /O is a simple invariant of FSM M given that state of the transition or otherwise removed from the
we necessarily get an output O where O O if we candidate list. After a number of iterations, either a
obtain the input in under the premise that each time single current state is obtained and we move to the
the trace i1 /o1 , i1 /o1 , . . . , in1 /on1 is observed. second step to detect the fault, or an input/output pair
Obligation invariant. It is used to express properties is not accepted by any candidate state. In the latter
such as if y happens then we must have that x had case, a fault is detected.
happened. Secondly, fault detection is carried out by applying
They presented two algorithms to check from left- the search technique to the current state and the
to-right and right-to-left a finite trace to give a verdict current input/output pair. If a state which does not
without considering the time constraints on the traces. accept the following transition is reached, then there
TIPS [11] (Test Invariant of Protocols and Services) is an is an error; otherwise, then the end of the trace is
implementation tool of this approach. reached, and no error is detected.
To express temporal properties, Andres et al. [1][3]
introduced Timed Invariant as an extension of simple This method does not consider the time constraints on
invariant with time constraints between an input and the traces and is not applicable to the case where the trace
an output. There are some limitations of their Timed is collected from the execution of multi-sessions that run
Invariant model as listed below. in parallel.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1341

C. Passive Testing of Web Services where


In recent years, many methods have been proposed Event represents an input/output message name;
together with tools developed for passive testing of web Const := P V |Const Const|Const Const
services [4][6], [13], [17], [20]. These work focus on where
either checking the order of messages and/or its occur- P are the parameters. These parameters repre-
rence time on a trace file to give a verdict [13], [20], [21] sent the relevant fields in the message;
or proposing a method for dynamic statistics [4], [6] of V are the possible parameters values;
some properties of web services. {=, 6=, <, >, , };
Dranidis et al. [17] proposed the utilization of Stream A means not(A).
X-machines for constructing formal behavioral specifi-
cations of web services. They also presented a runtime Definition 2. Formula. A formula is recursively defined
monitoring and verification architecture and discussed as
how it can be integrated into different types of service- F := start(A) | done(A) | F F | F F | Od[m,n] F
oriented infrastructures. However, they did not present an
algorithm or a tool to verify an execution trace using the where
Stream X-machines specification of web services. A is the atomic action;
Baresi et al. [4], [5] presented a monitoring framework start(A): A is being started;
for BPEL orchestration which is obtained by integrating done(A): A has been finished;
d[m,n]
two approaches namely Dynamo and Astro, which are O F : F was true in d units of time ago if
used for dynamic statistics of some properties of BPEL m > n, and F will be true in the next d units of
processes from single instance or multi instances. These time if m < n where m and n are natural numbers.
work focus on the behavioral properties of composition
Definition 3. Data correlation. A data correlation is
processes expressed in BPEL rather than on individ-
a set of parameters that have the same data type where
ual web services. Moreover, an assessment (a verdict
each different parameter represents a relevant field in a
true/f alse) about service is not considered in this work.
different message, for which the operator = (equal) is
Cavalli et al. [13] proposed a trace collection mecha-
used to compare the equality amongst parameters. A data
nism for SOA by integrating modules within BPEL engine
correlation is considered as a property on data.
and a tool [13], [16] that checks off-line execution traces.
This approach uses the Nomad [14] language to define Example 1. Let A(pA A B B B C
0 , p1 ), B(p0 , p1 , p2 ) and C(p0 )
A B
the security rule though, it does not allow us to check be messages with pi the parameters where p0 , p0 and
real-time (i.e., on-line) whenever a message happens. pC
0 have the same data type. A data correlation set that is
Moreover, this work does not consider the data correlation defined based on A, B and C is {pA B C A
0 , p0 , p0 } {p0 =
B C
between the messages in the rules. p0 = p0 }.
Li et al. [20], [21] presented the pattern and scope By putting the time constraints into an interval, we
operators as the rule-based to define the interaction con- support only two types of rules, namely permission and
straints of web services. The authors use FSM as semantic forbidden. Permission means that all traces must satisfy
representation of interaction constraints. In this approach, the constraints; whereas forbidden is the negation of a
the validation process runs in parallel with the trace permission constraint.
collection. This approach is limited by the pattern number, Definition 4. Rule with data correlation. Let and
while it does not consider the time constraints. be formulas, and CS be a set of data correlations
based on and (CS is defined based on the messages
III. RULE DEFINITION of and ). A rule with data correlation is defined as
A. Syntax R(|)/CS 2 where R {P: Permission; F: Prohibi-
tion;}. The constraint P(|) or F(|) (where F(|)
In our work, we consider each message as an atomic ac- = P(N OT |)) respectively means that it is permitted
tion, and use one or several messages to define a formula or prohibited to have true when context holds within
with logical operations AND and OR. We also use the the conditions of CS.
operation NOT to indicate that a message is not permitted
to appear in the trace within a duration. During the Example 2. We create a new account on the services
formula definition, the constraint on message parameters if we successfully logged in within maximal one day ago
values may be considered. Finally, from these formulas, and have not yet logged out by now. The rule with data
the rule is defined in two parts, namely supposition (or correlation for this event can be denoted as
condition) and context. The set of data correlations are P(start(createAccountReq)|Od[1,0]D
included as an option. done(loginRes) done(logoutReq)).
Definition 1. Atomic action. An atomic action is either In case we want to indicate the messages belonging to
an input message or an output message, formally denoted a session by using sessionId, we can denote it as
as
AA := Event(Const)|AA 2 CS is an optional part.

2012 ACADEMY PUBLISHER


1342 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

P(start(createAccountReq)|Od[1,0]D Function contain. This function verifies whether a


done(loginRes) done(logoutReq))/ message is contained by the context of a rule. It
{{createAccountReq.sessionId, returns true if a message is found in the context
loginRes.sessionId, logoutReq.sessionId}} of a rule and its conditions are validated (if they
are defined). For example, given the context of the
rule msgA(id = 5)&msgB, when message msgA
B. Semantics
(with its value id = 4) arrives, the function returns
A model of rules corresponds to a pair r = (Pr , Cr ) f alse because but its condition (i.e., id = 4) does
where not match that in the rule even if the message name
Pr is a total function that associates every integer x is found; whereas when message msgB arrives, this
with a propositional formula. function returns true.
Cr is a total function that associates every integer Function update. This function updates the value
x with a pairs (, d) where is a formula and d a of context whenever a message arrives and
positive integer. is found in the context (verified by function
Intuitively, x, p Pr (x) means that proposition p is contain). For example, the context of a rule is
true at time x; while (, d) Cr (x) means that context loginResponse logoutRequest. When message
of formula holds (is evaluated true) at time t where loginResponse arrives, this context is updated as
true logoutRequest.
t [x, x + d] if we focus on future time.
Function evaluate. This function evaluates whether
t [x d, x] if we focus on past time.
or not a context of rule holds (true) by returning
one of three values, namely either true or f alse
IV. V ERIFICATION or undef ined. The undef ined value is returned if
A. Correctness of a System there is at least one message name in the context of
the rule. For instance, context truelogoutRequest
The following definition is a formal description for the
is evaluated to be undef ined. During the evaluation,
correctness of a system. That is, a system is correct if
a message with the function N OT 3 will be provi-
the execution traces obtained from the IUT satisfy the
sionally assigned as true. For example, at the time
properties expressed by the rules, and a system fails if a
of evaluation, the expression truelogoutRequest
rule is timeout or its content is evaluated to be f alse.
will be evaluated as true.
Definition 5. Correctness of a timed trace w.r.t. a
finite set of rules. Let = 0 .1 .2 ... be an observable As foregoing, there are two types of rules, namely
timed trace that is collected from a running system future time and past time rules. To make this more clear,
where i=0,1,2,... = (mi , ti ) denotes the message and its we will analyze the checking algorithm for each type.
occurrence time, and let = {0 , 1 , ..., n } be a finite 1) Rules with future time:
set of rules, define conforms to if and only if i , Given that each rule has two parts (i.e., the supposition
@ j such that j is timeout at ti or the evaluation of j and context parts), a rule will be evaluated as either true
after updating its context is f alse. or f alse or undef ined if its supposition has been enabled
and the current message belongs to its context. At any
B. Checking Algorithm occurrence time t of message msg, our algorithm checks
the correctness of a rule by two steps.
In this section, we give the outline for the computation
mechanism used to determine whether a rule holds for Step 1. Examine the list of enabled rules currlist to
some given input/output sequence of events. Our algo- evaluate their context if the time constraints are valid.
rithm verifies message-by-message the conformity with If the context of a rule is evaluated to be true/f alse,
each rule without storing the message sequence. Here, we then it will be removed from the enabled list currlist
use two global variables, namely currlist and rulelist. and the corresponding verdict is returned. Otherwise
currlist is a list of enabled rules that have been activated, (i.e., the context is undef ined, meaning incomplete
while rulelist is the list of defined rules that are used context), we wait for the arrival of the next message
to verify the system. Before introducing the detail of and return true to the verdict.
our algorithm, we present some functions running on the Step 2. Examine the list of rules rulelist to acti-
context of each rule. vate them if their supposition contains the current
message msg.
Function correlation. This function will return one
of three values, namely either undef ined or true or Algorithm 1 shows how to checks the correctness of
f alse. Value undef ined is returned when a message a message with a set of future time rules, in which we
is not defined in the set of data correlations of rule. If assume that the rules are P ermission (the F orbidden
a message is defined in the set of data correlations of rules are the negation of the verdict of the P ermission
rule, then this function will query the corresponding rules), and do not consider data correlation.
value and return true/f alse after comparing it with
the value of the previous messages. 3 This function only applies to atomic actions.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1343

Algorithm 1: Checking algorithm for future time Algorithm 2: Checking algorithm for past time rules
rules Input : timed event: (msg, t)
Input : timed event: (msg, t) Output: true/f alse
Output: true/f alse verdict true
verdict true 1) For each r currlist
1) For each r currlist IF the time constraints of r at t are validation
IF the time constraints of r at t are validation
IF msg belongs to the supposition of r
IF msg belongs to the context of r Remove r from currlist
Update context of r by msg IF the evaluation of the context of r is
IF the evaluation of the context of r is f alse/undef ined
true/f alse verdict verdict f alse
Remove r from currlist ELSE IF the context of r contains msg
verdict verdict true/f alse
Update the context of r by msg
ELSE: verdict f alse
ELSE: verdict f alse
2) For each r rulelist
2) For each r rulelist
IF msg belongs to the supposition of r
IF msg belongs to the context of r
Update the activated time for r by t
Update the activated time for r by t
Add r into currlist (activated)
Add r into currlist (activated)

2) Rule with past time:


For a rule with past time, the context part will happen
before its supposition, meaning that the context part must
be evaluated to be true/f alse whenever its supposition
handles the current message. Upon the arrival of any
timed event (msg, t), our algorithm checks correctness
of a rule with past time by two steps.
Step 1. Examine the list of enabled rules currlist
to check the correctness of current message msg.
If t satisfies their time constraints and msg be-
longs to their supposition, then remove them from
list currlist. At the same time, if their context is
evaluated to be f alse/undef ined, then a f alse
verdict will be assigned; otherwise, a true verdict
is admitted. On the other hand, if msg does not
belong to their supposition and msg is found in their Fig. 2. Architecture of the RV4WS tool
context, then we update their context by msg and
wait the next message to evaluate these rules.
Step 2. Examine the list of rules rulelist and activate the message name and its time occurrence as: (a1 ,0),
them if their context contains the current message (a2 ,2), (a1 ,3), (b2 ,8), (b1 ,9), (a2 ,12), (b3 ,15), (c1 ,16), . . . .
msg. The security rules defined to assess the system are:
Algorithm 2 shows how to checks the correctness of a r1 = P(start(a1 )|Od[0,10] done(b1 ) done(c1 )),
message with a set of past time rules under the assumption r2 = P(start(b2 )|Od[+,0] done(a2 ) done(c2 )).
that the rules are P ermission. The table 1 shows the results of the algorithm execu-
Be combining the above two algorithms, we give the tion. In the table, a f alse verdict is returned at message
complete checking algorithm as shown in algorithm 3. It (b3 , 15) due to the failure of rule r1 at time 15 of which
verifies event-by-event and returns the verdict whenever the last enabled message is (a1 , 3).
a timed event happens. Two functions verif y f uture()
and verif y past() called by algorithm 3 are shown in V. RV4WS T OOL
algorithms 4 & 5. RV4WS (Runtime Verification for Web services) is a
There is an exception that a f ail verdict is returned software tool implemented to verify a web service at
if the algorithm finds a rule that is not satisfied and not runtime based on a set of constraints defined by the syntax
applicable to current message. To identify which rule fails in Section III. This tool receives a sequence of messages
upon an arrival of message, we propose a graphic statistics (message content and its occurrence time) via a TCP/IP
to show the current test states. port, then verifies the correctness of this sequence. The
Example 3. We have an execution of timed trace with architecture of RV4WS is shown in Figure 2.

2012 ACADEMY PUBLISHER


1344 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

message enabled rule list verdict add/remove (+/-)


(a1 , 0) {r1+ = P(true|Od[0,10] done(b1 ) done(c1 ))} true +r1
(a2 , 2) {r1 = P(true|Od[0,10] done(b1 ) done(c1 )); true +r2
r2+ = P(start(b2 )|Od[+,0] true done(c2 ))}
(a1 , 3) {r1 = P(true|Od[0,10] done(b1 ) done(c1 ));
r2 = P(start(b2 )|Od[+,0] true done(c2 )); true +r1
r1+ = P(true|Od[0,10] done(b1 ) done(c1 ))}
(b2 , 8) {r1 = P(true|Od[0,10] done(b1 ) done(c1 )); true -r2
r1 = P(true|Od[0,10] done(b1 ) done(c1 ))}
(b1 , 9) {r1 = P(true|Od[0,10] done(b1 ) done(c1 ))} true -r1
(a2 , 12) {r1 = P(true|Od[0,10] done(b1 ) done(c1 )); true +r2
r2+ = P(start(b2 )|Od[+,0] true done(c2 ))}
(b3 , 15) {r2 = P(start(b2 )|Od[+,0] true done(c2 ))} false* -r1
(c1 , 16) {r2 = P(start(b2 )|Od[+,0] true done(c2 ))} true
TABLE I
A N EXAMPLE OF RUNTIME VERIFICATION

Algorithm 3: The detail of runtime verification algorithm


Require: currlist is the list of current rules that were enabled,
rulelist is list of rules that are defined to verify the system.
Input : message msg, occurrence time t.
Output : true/f alse
1 res true;
2 list ; // a list;
3 // step 1: check in currlist to give a verdict;
4 foreach rule in currlist do
5 // if a rule is enabled many times, we just pick up the first one to consider and use a variable list to
handle this problem;
6 if rule.id / list then
7 if rule is future time then
8 res res verify_future(rule, msg, t);
9 else
10 res res verify_past(rule, msg, t);
11 list.add(rule.id);

12 // step 2: check in rulelist to enable new rules;


13 foreach rule in rulelist do
14 if msg rule.supposition() rule.condition(msg)= true then
15 if rule is future time then
16 r1 rule; // create a new rule;
17 r1.active time t; // set active time;
18 r1.getDataCorrelationValue(msg);
19 currlist.add(r1); // add into enabled list;
20 // the rule is not processed in the first step (rule.id
/ list) if it is a past time rule;
21 else if rule.correlation(msg)6= f alse rule.evaluate()! = true rule.id / list then
22 res f alse;
23 // the rule is not processed in first step (rule.id
/ list) if it is a past time rule;
24 else if rule is past time rule.id / list rule.context.contain(msg) then
25 r1 rule; // create a new rule;
26 r1.active time t; // set active time;
27 r1.update(msg) // update context;
28 r1.getDataCorrelationValue(msg);
29 currlist.add(r1); // add into the list of enabled rules;

30 return res;

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1345

Algorithm 4: verify future(rule, msg, t)


Require: currlist: is a global variable
Input : rule: a rule, msg: a message, t: occurrence time
Output : true/f alse
1 result true;
2 // the time condition is FALSE and the type of rule is P ermission;
3 if verifyTime(t, rule.active time) = false rule.type =0 P 0 then
4 result f alse;
5 currlist.remove(rule);
6 else if r.context.contain(msg) rule.correlation(msg)6= f alse then
7 rule.update(msg) // update context;
8 if rule.evaluate() = true then
9 currlist.remove(rule);
10 // the time condition is TRUE and the type of rule is P rohibition;
11 if rule.type =0 F 0 verifyTime(t, rule.active time) = true then
12 result f alse ;
13 else if rule.evaluate()=false then
14 currlist.remove(rule);
15 // type of rule is P ermission;
16 if rule.type =0 P 0 then
17 result f alse;

18 return result;

Algorithm 5: verify past(rule, msg, t)


Require: currlist: is a global variable
Input : rule: a rule, msg: a message, t: occurrence time
Output : true/f alse
1 result true;
2 if msg rule.supposition() rule.condition(msg)=true
rule.correlation(msg)6= f alse then
3 currlist.remove(rule);
4 if rule.evaluate() = true then
5 // the time condition is TRUE and type of rule is Prohibition;
6 if rule.type =0 F 0 verifyTime(t, rule.active time) = true then
7 result f alse;
8 else
9 // the type of rule is Permission;
10 if rule.type =0 P 0 then
11 result f alse;

12 else
13 if verifyTime(t, rule.active time) = false then
14 currlist.remove(rule);
15 else if rule.context.contain(msg) rule.correlation(msg)6= f alse then
16 rule.update(msg);

17 return result;

2012 ACADEMY PUBLISHER


1346 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Fig. 3. ParseData Interface of RV4WS

The checking engine in the architecture implements the


runtime verification algorithm 3. It allows us to verify Fig. 4. Rule format example
each incoming message without any constraint of order
dependencies, and is applicable to both of on-line and
off-line testing. Moreover, it verifies the validation of executing, and also visualize the relationships between
current message without using any storage memory. In them. If we had used a histogram view and applied it to
order to use this engine for other systems, we define each, we would not have been able to get this information
an interface IP arseData shown in Figure 3 as an because of the different scales of these properties. We
adapter to parse the incoming data for RV4WS if the data built a visualized interface which is based on the idea of
structure of input/output messages from another system is parallel coordinates scheme introduced by Inselberg [19].
different from ours. The methods in IP arseData are for In information visualization, parallel coordinates view
gathering information from incoming messages. Method is used to show the relationships between items in a multi-
getM esssageN ame() returns the message name from its dimensional data set. Each axes in this view parallels
content, while method queryData() allows us to query a to each other and a point in an n-dimensional space is
data value from a field of message content.This interface represented as a polyline with verticals on these axes.
is implemented for each application case. For example, Considering that the list of statistical properties of our
its implementation is class P arseSoapImpl for a web testing process is a multi-dimensional data set, we have
service application. This engine has been designed as a applied this visualization to RV4WS tool and made it
java library and is controlled by a component known as possible to explore the result of our checking algorithms.
Controller which receives a data stream coming from a As foregoing, we have implemented the checking al-
TCP/IP port. gorithms inside RV4WS tool which enables a user-tester
The input format for this tool is an XML file as defined to verify these conditions defined in rules. When the
in Figure 4. A rule with a true or f alse verdict respec- user-tester finds that rules properties change over time,
tively represents a permission or prohibition. The context he/she may need a complete view of these traces of
of a rule will be expressed as an expression with three testing process. There are parallel coordinates views cor-
operators AN D, OR and N OT . Each data correlation is responding to rules. In Figure 5, each scheme of parallel
defined as a property with some query expressions from coordinates represents a time-log of statistical values
different SOAP messages. For web service applications, as these polylines crossing properties axes. Within each
we have developed a Graphic User Interface (GUI) that view, there is a single polyline per time instance. The lines
allows us to easily define a set of rules from WSDL files. of current time are always highlighted. This view enables
The checking algorithm returns a f ail verdict if a the tester to quickly tell from the GUI whether or not these
rule is found not satisfied, meaning that this rule is changes of executing rules properties are interesting. This
not applicable to the current message. To identify which visualization does not have to be refreshed in real-time,
rule fails at an arrival of message, we have developed rather, it can be refreshed after a duration.
a Graphic User Interface (GUI) for visualizing some
statistical properties that are calculated at any moment
of testing process. Whenever a rule is activated which VI. A CASE STUDY
means that its conditions have been satisfied, a statistical In this section, we present a real-life case study known
property such as type counter will be used to compute the as Product Retriever [23] from WebMov project, and tell
percentage of unsatisfying time when applying the rule how to apply our RV4WS tool to test Product Retriever.
to the input data stream. If the rule has been satisfied, This case study is a BPEL process that allows users to
we need to know the time duration from the activating automate part of the purchasing process. It enables users
moment to its contexts holding moment. For each rule, to retrieve one searched product sold by a authorized
we have three statistical properties about time, namely provider. The search is limited by specifying a budget
time-min, time-max and time-average. range and one or more keywords characterizing the prod-
Now we need to know the values of these statistical uct. The searched product is done through the operation
properties such as the failure percentage in proportion getProduct and the parameter RequestProductType that is
to its duration time or to others properties for a rule composed of information about the user (first-name, last-

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1347

Fig. 5. The main GUI and checking analysis of RV4WS tool

name and department) and searched product (keyword,


max price, category).
The process contains four partner services, namely
AmazonF R, AmazonU K, CurrencyExchange and
P urchaseService. They are developed by Montimage4
and available online5 . The overview behavior of the
process is illustrated in Figure 6 and described as follows.
1) Receives a message from the client with the product
and keywords of the characteristics of the product.
2) Contacts the P urchaseService partner to obtain
the list of authorized providers for that product.
In case there is not any authorized provider, an
announcement is be sent to the client by a fault
message response.
3) Depending on the authorized provider result, the
process contacts either the AmazonFR or Ama-
zonUK service to search a product that matches the
price limit by Euro and the keywords.
4) Sends back to the client the product information
and the name of the provider where the product was
found, and the link from which it can be ordered.
If a matching product is not found, a response with
unsatisfied product will be sent back to the client.
5) After receiving the product information, the client
can send an authorization request to confirm the
purchase of the product within a certain duration of
time (e.g., one minute).
The Product Retriever service is built with Netbeans
6.5.1 and deployed by a Sun-Bpel-engine within a Glass- Fig. 6. ProductRetriever - BPMN specification
fish 2.1 web server.
A. Test Product Retriever by RV4WS tool
4 http://www.montimage.com/
5 http://80.14.167.59:11404/servicename In this section, we present some preliminary results
from our first experiment on the case study of Product

2012 ACADEMY PUBLISHER


1348 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

must have received a product response message with


field EmptyResponseProduct being null one minute
ago. In this rule, the data correlation is used by
userId.
P(start(getAuthorizationRequest)|Od[1,0]m
done(getP roductResponse
[EmptyResponseP roduct = null]))/
(getAuthorizationRequest.userid,
getP roductResponse.userid)
2) Checking results: Figure 8 illustrates the checking
analysis of the Product Retriever, which indicates
The fault messages that are defined in rule 1 do not
occur (see the percentage in fail column of rule 1).
Message getProviderResponse with provider =
AmazonU K appeared three times (see value in
enabled count column of rule 2), however there are
two times where the tool did not found message
getCurrencyRateRequest within 10 seconds from the
occurrence time of message getProviderResponse. In
Fig. 7. Testbed architecture
Figure 9, we found the interval time between them
is 26 seconds for the first fail case and 42 seconds
Retriever using RV4WS tool. SoapUI [24] is a well known for the second fail case. Then the tool produces the
test tool for web services. We use it in our experiment fail verdicts (the fail column of rule 2).
Message getAuthorisationRequest appeared two
as a client of Product Retriever service, sending requests
to activate the web service (i.e., BPEL process). To times (see value on enabled count column of rule
collect the communication messages between the Product 3). Before that, message getProductResponse also
Retriever service and its partners (including SoapUI), appeared with field EmptyResponseProduct being
we have developed a proxy that allows us to forward empty and the interval time between them less than
a message to a specified destination. This allows us to one minute.
receive and forward from/to some sources and destina- In Figure 9, a f alse verdict is returned when the
tions. Each connection is handled on a different port. itemSearchResponse arrives because at the occurrence
Afterwards, this message and its time occurrence are time of itemSearchResponse, the time constraint of rule 2
also sent to our RV4WS tool, to check its correctness. (i.e., 10 seconds) is not satisfied.
SoapUI and Product Retriever service were configured
to make connections through the proxy. The connection VII. C ONCLUSIONS
information (service name) is also sent to RV4WS to help This paper presents a passive test method for systems
this tool easily identify which message belongs to which in particular for web services with (1) the definition of a
service. Figure 7 shows our testbed architecture. language including logic expressions for constraints and
1) Rule definition: We can define many test purposes (2) a verification method and a tool implementing the
to verify the interaction order with partner services. Here verification algorithm. This tool has been integrated in
we introduce three test purposes: the WebMov tool chains. To verify the practicability of
During the execution of service, if the client receives
the proposed method on real systems, a real case study
a P roductF ault message, then the Purchase ser- which is a web service composition known as Product
vice must have already returned a P roviderF ault Retriever has been extensively studied.
message. The time constraint for this test purpose is Extensions planned for this research include (1) a
less important, so we can define the maximal time system for calculating the test coverage (corresponding to
interval between two messages as 10 seconds. real need of the implementor of the web services), (2) an
extension to test more complex distributed systems such
P(start(P roductF ault)|Od[10,0]s
as cloud computing architecture by integrating a set of
done(P roviderF ault))
distributed observers with recoveries of all the traces that
If the Purchase service introduces the provider ser- need to be synchronized.
vice AmazonUK, then the orchestration must contact
the CurrencyExchange service within 10 seconds. ACKNOWLEDGMENT
P(start(getP roviderResponse[provider = We would like to thank Ms. Nguyen Thi Kim Dung, a
AmazonU K])|Od[0,10]s master student from PUF (Pole Universitaire Francais) in
done(getCurrencyRateRequest)) Ho Chi Minh city, for helping us develop the RV4WS tool
When the client sends an authorization request mes- during her internship in LaBRI. We also thank Montimage
sage to confirm the purchase of a product, then it for their case study Product Retriever.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1349

Fig. 8. Checking analysis of Product Retriever

Fig. 9. A part of collected trace of Product Retriever

2012 ACADEMY PUBLISHER


1350 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

R EFERENCES [21] Z. Li, J. Han, and Y. Jin, Pattern-Based Specification and Val-
idation of Web Services Interaction Properties, In Proceedings
[1] C. Andres, Mercedes G. Merayo, and M, Nunez, Formal cor- of the 3rd International Conference on Service Oriented Comput-
rectness of a passive testing approach for timed systems, IEEE ing (ICSOC05), pp. 7386, Dec 1215, 2005, Amsterdam, The
International Conference on Software Testing, Verification, and Netherlands.
Validation Workshops, pp. 6776, Apr 0104, 2009, Denver, Col- [22] M. Tabourier and A. Cavalli, Passive testing and application to
orado, USA. the GSM-MAP protocol, Information ans software technology,
[2] C. Andres, Mercedes G. Merayo, M. Nunez, Passive Testing of 41:813821, 1999.
Stochastic Timed Systems, International Conference on Software [23] W. P. Consortium, D5.1 webmov case studies: definition of
Testing Verification and Validation, pp. 7180, Apr 01-04, 2009, functional requirements and test purposes, WebMov, Tech. Rep.
Denver, Colorado, USA. WEBMOV-FC-D5.1/T5.1, 2009.
[3] C. Andres, Mercedes G. Merayo, M. Nunez, Passive Testing of [24] Eviware, http://www.eviware.com/.
Timed Systems, International Symposium on Automated Technol-
ogy for Verification and Analysis, pp. 418427, vol. 5311, LNCS,
2008.
[4] L. Baresi, S. Guinea,M. Pistore, and M. Trainotti, Dynamo +
Astro: An integrated Approach for BPEL Monitoring, 2009 IEEE
International Conference on Web Service, pp. 230237, July 610,
2009, Los Angeles, CA, USA.
[5] L. Baresi, S. Guinea, R. Kazhamiakin, and M. Pistore, An
Integrated Approach for the Run-Time Monitoring of BPEL Or-
chestrations, The 1st European Conference on Towards a Service-
Based Internet, pp. 112, 2008, Madrid, Spain.
[6] L. Baresi and S. Guinea, Towards Dynamic Monitoring of WS-
BPEL Processes, The Third International Conference on Service-
Oriented Computing, pp. 269282, Dec 1215, 2005, Amsterdam,
The Netherlands.
[7] A. Benharref, R. Dssouli, Mohamed A. Serhani, A. En-Nouaary,
and R. Glitho, New Approach for EFSM-Based Passive Testing of
Web Services, Testing of Software and Communicating Systems,
pp. 1327, vol. 4581, 2007.
[8] H. Barringer, A. Goldberg, K. Havelund, and K. Sen, Rule-
Based Runtime Verification, 5th International Conference on
Verification, Model Checking, and Abstract Interpretation, Jan 11
13, 2004, Venice, Italy.
[9] E. Bayse, A. Cavalli, M. Nunez, and F. Zaidi, A passive testing
approach based on invariants: application to the WAP, Computer
Networks, 48:247266, 2005.
[10] T.-D. Cao, T.-T. Phan-Quang, P. Felix, and R. Castanet, Auto-
mated Runtime Verification for Web services, IEEE International
Conference on Web services, pp. 7682, July 510, 2010, Miami,
FL, USA.
[11] A. Cavalli, Edgardo Montes De Oca, W. Mallouli, and M. Lallali,
Two Complementary Tools for the Formal Testing of Distributed
Systems with Time Constraints, 12th IEEE International Sym-
posium on Distributed Simulation and Real Time Applications,
Canada, Oct 2729, 2008.
[12] A. Cavalli, C. Gervy, and S. Prokopenko, New approaches for
passive testing using an extended finite state machine specifica-
tion, Infomation and Software technology, 45(12):837852, 2003.
[13] A. Cavalli, A. Benameur, W. Mallouli, and K. Li, A Passive
Testing Approach for Security Checking and its Pratical Usage
for Web Services Monitoring, NOTERE 2009, Montreal, Canada,
2009.
[14] F. Cuppens, N. Cuppens-Boulahia, and T. Sans, Nomad: a se-
curity model with non atomic actions and deadlines, 18th IEEE
Workshop on Computer Security Foundations, pp. 186196, June
2022, 2005, Aix-en-Provence, France.
[15] S. Halle, R. Villemaire, and O. Cherkaoui, Specifying and
Validating Data-Aware Temporal Web Service Properties IEEE
Transactions on Software Engineering, 35(5):669-683, 2009.
[16] W. Mallouli, F. Bessayah, A. Cavalli, and A. Benameur, Security
Rules Specification and analysis Based on Passive Testing IEEE
Global Telecommunications Conference, 2008, pp. 16, Nov 30
Dec 4, 2008, New Orleans, LA, USA.
[17] D Dranidis, E. Ramollari, and D. Kourtesis, Run-time Verification
of Behavioural Conformance for Conversational Web Services,
2009 Seventh IEEE European Conference on Web Services, pp.
139147, Nov 911, 2009, Eindhoven, The Netherlands.
[18] A . Goldberg and K. Havelund, Automated Runtime Verification
with Eagle, Verification and Validation of Enterprise Information
Systems, May 24, 2005, Miami, USA.
[19] Alfred Inselberg, The plane with parallel coordinates, The Visual
Computer, 1(2):6991, 1985.
[20] Z. Li, Y. Jin, and J. Han, A Runtime Monitoring and Validation
Framework for Web Service Interactions, Proceedings of the
Australian Software Engineering Conference, pp. 7079, Apr 18
21, 2006, Sydney, Australia.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1351

Algorithm of diffraction for standing tree based


on the uniform geometrical theory of
diffraction
Yun-Jie Xu*
School of technology, Zhejiang Agricultural & Forestry University, Linan, China
Email: xyj9000@163.com

Wen-Bin Li
School of technology, Beijing Forestry University, Bei jing, China
Email: leewb@bjfu.edu.cn

Shu-Dong Xiu
School of technology, Zhejiang Agricultural & Forestry University, Linan, China
Email: sdxiu@zjfc.edu.cn

Abstract The diffraction fields and the shadow of standing policymaking on the prevention or control of forest
tree diffraction are solved by using polygons to approximate- fires [1].
ly take the place and approximation as approximate substi- Results show signal frequency, stem form and
tutes of for the circle. The algorithm model of ray path propagation direction etc influence the diffraction
tracing is computed by tracing the ray path on standing tree. fields and the blind region of diffraction. In this paper,
Considering electromagnetic wave propagation, which a
algorithm of the diffraction fields and the blind region
physical model of diffraction for a forest is constructed. The
mathematical model of diffraction is presented using the of diffraction in plantation are presented by accounting
uniform geometrical theory of diffraction(UTD). Moreover, for application of the electromagnetic waves
the expression for diffraction loss is derived. The results were propagation and experimentation.
then applied to the fir. Simulation and analysis show the
validity of the proposed model. II. PHYSICAL MODEL OF DIFFRACTION FOR STANDING TREE

Index Terms UTD, standing tree, algorithm of diffraction Ray theories, like the Geometrical Theory of Diffraction
(GTD) and its uniform version (UTD) [2], are a very useful
approach to characterize the scattering from objects and to
estimate the electromagnetic field in arbitrary complex
I. INTRODUCTION environments. In such theories, the electromagnetic field is
Continuous data acquisition on multi-standing trees, described in terms of rays arising from a source,
which are sampled from a forest and are less than 50 m propagating through the scenario and interacting with it.
away from one another, is achieved. Thus, the These rays can be reflected or diffracted by the objects
information transfer between sensors is fully around the source. When the environment is complex,
implemented. The sensor has a data acquisition multi- considering only the (singly) wedge diffracted rays is
interface connecting sensors for temperature, usually inadequate to reach the desired accuracy, especially
illumination, and soil moisture, among others, which when observing in low-field shadow regions. Therefore it
are all placed at breast height, thereby achieving is necessary to introduce higher order diffraction
wireless communication through an organized mechanisms that consist in vertex and multiple edge
meshwork between sensors. The system can be diffracted rays.
practically used in the dynamic monitoring of growing Stem form is diverse, but as a whole it consists of
forest stock and in obtaining the biomass ranges of the the basal area of breast-height and the basal area of
trees growing in different directions. The information end long. The wireless sensor placed at a breast
gathered using the proposed system are useful for height of 1.3 m can be further lowered to the basal

Manuscript received September 11, 2011


*
Corresponding author; E-mail: xyj9000@163.com

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1351-1357
1352 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

area in the current paper, so only the effects of the r = (a cos v, a sin v, u) (2)
basal area of breast height are considered. Circular according to the nature of cylindrical developable
area expressions are used for the measurements of surface ray path of the scene point source pr to field
the basal area of breast height and standing wood point pt is cylindrical spiral, arc length of ray is
volume, with an average error of 3%. calculated as:
Fig.1 is the cylindrical surface r (u, v), source point s = (u 2 u1 ) 2 + (v2 v1 ) 2 (3)
is p t (u 1 , v1 ) in the cylindrical surface, field point is p r
according to S yields N of Polygon edges. The
(u2 , v2 ). The interval[v1 , v2 ] is divided into N segments
along u=u 1 , N-1section point to determine the N-1 model is shown in Fig.2, where( x0, y0, z 0 ) is source
items u coordinate line; Similarly, the interval [u1 , u 2 ] point , (x, y, z )is the secondary source point,
is divided into N segments along v = v1 , M-1 section (x 1,y1,z1 ) and (x n, yn, zn)is the receiver point with the
point to determine the M-1 items v coordinate line, number border of polygon n, which is decided by trees
then these u lines and v lines intersect, the intersection breed and diameter. The precision of the model
of p ij for all surfaces can be characterized by increases with the increasing n. The radial can be
discretely. crawling process of ray lines is divided to divided into two components: one is that the incident
M segments using v coordinates of M-1 items. namely energy propagates straightly according to the law of
ray tracing of the solution process is divided into M geometrical optics, and the other is that the incident
phase, phase variable K = 1, 2, 3, , M[4]. energy propagates along the edge of stand tree, and the
shadow and transition regions are formed when the
radial of edge continual give off diffracted radial along
the tangency. From the uniform geometrical theory of
diffraction (UTD) introduce by P.H.Pathak and
R.G.Kouyoymjian that the region of transition is
uniform. By accounting for application of the disposal
of the sensor, we only consider the illuminated region
and the shadow in this paper.

(x',y',z') r (x1,y1,z1)
R

r S illuminated
(x0,y0,z0) '
Figure 1. ray path tracing of standing tree
(xn,yn,zn)
Phase 1 of the state S 1 = {p t}Phase K of the state shadow
sets of value selection is intersection point set S k = { p ij
[ui , vj] of coordinate net. Where i = K > 1 , j = 1 ,2 ,3 Figure 2. The simplified polygons model of standing tree
,, N} .
Phase K of the state sets of decision-making is III. THE MATHEMATICAL MODEL OF DIFFRACTION FOR
intersection point set Dk = { p ij [ u i , v j ] . STANDING TREE
Based on the above section and variable definition,
iteration relation of the ray tracing The problem of standing tree diffraction employing
f n +1 ( sn +1 ) = 0 polygons to approximately take the place of the circle can
be solved by the theory of diffraction through a wedge. To
f k ( s k ) = min {d ( s k , xk ) + f k +1 ( s k +1 )} (1)
xk Dk describe the diffraction effects caused by the edges of
Where, k = n , n-1 , ,2 ,1; d ( s k , xk ) = n k |s k xk | , n k surfaces with the relatively simple mathematic method, we
is refractive index, |s k xk | is the distance between two employ the Kirchhoff-Huygens approximation [8].
points. Huygens recognized that the fields reaching any precise
Tracing calculation steps can be divided into four surface between a source and receiver can be thought as
steps: the producing secondary point sources on the surface
1)Discrete surface, Determine the coordinates p ij; which in turn generate the received fields [9]. According to
2)By using the substitution k = n+1 , f k (sk ) = 0; the Kirchhoff approximation, The receiver point(x1 ,y1 ,z 1 )
3)By using the substitution k=k1, according to Eq. can be written as an integral for the complex amplitude
(1) yields f k (s k ) and xk , s k {p kj, j=1 , 2 ,, m}; over the plane x=0,which is given by
4)Repeat step until k=1, calculated f 1 (s 1 ) is the arc
length of the track tracing ray, coordinates points of
[ cos + cos
+ +
Ex1 , y1 , z1 )
the ray after the reverse can also be found.
if cylindrical parameter equation:

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1353

jke jkrR considered as the plane wave entering from the left to the
E inc (0, y , z ) d y d z (4) fringe with x=0 of standing tree. Considering uneven
4rR surface of trees, the fringe of standing tree is simplified as
Where rR is the distance from the secondary source an absorbing screen. To discuss the diffraction through
point in the plane x=0 to the receiver point (x, y, z) and effective angle, we consider the Z component of the field E
Einc is the amplitude of the secondary source. and the polarization of the plane wave. If the plane wave
The problem of diffraction of a plane wave by an edge propagates along the x-axis and r s>>r R , then the incident
through a wedge serves as a prototype for diffraction of angle =0 thus the model can be further simplified
electromagnetic waves through standing tree. As shown in shown in Fig.3. Where is diffraction angle and the
Figure.2, due to r s >>r R the distance from the source to wedge angle = (2-n) (1<n<2). The location of the angle
y-axis is much larger than that of the receiving point to the below the x-axis is the shadow boundary, and the shadow
y-axisthus wave generated by source is approximately region below the boundary exists only the diffraction field.

X1 Y

(x',y',z') 1 (x1,y1,z1)
r R
X (x2,y2,z2)
2
illuminated
1
2
(xn,yn,zn)

shadow

Figure 3. The simplified model of wedge diffraction

The Z component of the appropriate incident field will Substituting r R into (3), the integration over z in (6)
have spatial dependence given by reduces to
E incA0 e jkx (5) jk + + e jkrR
4 0
Ex1 , y1 ,0A0 1 + cos d z d y
Without loss of generality, the receiver points are rR
assumed to lie in the plane Z=0 because of the translation jk + e jk R
symmetry along Z (trunk height direction). With the Ex1 , y1 ,0 A0
4 0
(1 + cos )
R

foregoing assumptions, the diffracted field through the (8)
edge for standing tree is given by + ( z )
2

jk + + e jkrR exp jk dz dy
2 R
4 0
Ex1 , y1 ,0)A0 (1 + cos ) d z d y
rR By using the substitution u = ze j / 4 k / 2 R to carry out
(6)
Where the distance from the integration point to the integral transform, equation (8) changes into
receiver point is:
jk + e jk R j / 4 2 R
2
rR = x1 + ( y1 y ) + ( z )
2 2
(7)
Ex1 , y1 ,0 A0
4 0
(1 + cos )
R
e
k
dy

To carry out the z integration in (6), the primary


k +
e j / 4 e jk R
contribution to the integral comes from a region of z that = A0 (1 + cos ) dy
is given by the Fresnel zone, which is small compared with 2
2 R

x, i.e. x>> . The center of this region is the atationary- + e jk R

phase point where the derivative of the exponent krR with (1 + cos ) dy
0
respect to z vanishes. r R in the denominator of (7) and cos R
(9)
will hardly vary and can be treated as constants. e j / 4 k + e jk R
Eq.(7) can be expanded to second order according to R = A0
2 2 0 (1 + cos )
R
dy
with R = x12 + ( y1 y ) 2 , and thus r = + ( z ) .
2

R R
2 R Similarly, the expression R = x12 + ( y1 y ) 2 can be

2012 ACADEMY PUBLISHER


1354 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

expanded to second order as = y y1 according to Similarly, the expression R = x 22 + ( y 2 y1 ) 2 can be



R
expanded to second order according to
with = x12 + y12 . Here 1 is the angle between the X axis with = x 2 + y 2 = x1 + y1 , then R = y1 y2 / .
2 2 2 2

and the line from the edge to the receiver point, thus
Here 2 is the angle between the X axis and the line from
sin1 =y 1/ .
the edge to the receiver point, and thus sin2=y 2/ ,
By using the substitution v = y e j / 4 k / to carry out
where 2 =1-2/n, n=2mwith the integer m 3 .
integral transform, equation (9) turns into By using the expression v = y1 e j / 4 k / to carry out
e j / 4 k + e jk R
Ex1 , y1 ,0= A0 0 (1 + cos ) dy integral transform, equation (14) turns into
2 2 R e j / 4 1 + cos 2
e j / 4 e jk y Ex 2 , y 2 ,0= E inc ( x1 , y1 ,0)e jk
k + 2k 2 sin 2
A0
2 2
(1 + cos )

0
exp jky dy


= E inc ( x1 , y1 ,0) D( 2 )e jk (15)
e j / 4 1 + cos
= A0 e jk Where the diffraction coefficient D( 2 ) = e j / 4

2k 2 sin
(10) 1 + cos 2/( 2 sin 2 2k ) .
When carrying out the integration in (10), k is given a Accounting for the influence of the diffusion factor A(s)
vanishingly small negative imaginary part, as appropriate and according to UDTthe approach described above can
for atmospheric absorption, to ensure convergence at the also be used to derive the field at the receiver point of (xn,
lower limit, but after the integration, k can be taken to be
y n, zn) , i.e.
real. Let the diffraction coefficient D ( 1 ) =
E n ( s) = E n1 (Q) D( n ) A( s ) exp( jks)
j / 4
e 1 + cos 1/( 2 sin 1 2k ) , and substituting Eq. (5) (16)
into Eq.(10) yields where E n (s) represents the diffraction field of the
x1 , y1 ,0= A0e jk D(1 ) = E inc (0, y, z)e jk D(1 ) (11)
E location s distance from the diffraction point Q, E n (Q )
represents incident end-field of the radial at the diffraction
In Fig.3., (x1 ,y1,z1 ) is the secondary source point and
point Q, D( n ) is the diffraction coefficient, the diffusion
(x2 ,y2,z2) is the receiver point. The receiver field is given
by factor A( s) = 1 / s , exp( jks) is the phase delay factor,
+ + and the wave number k = 2 / .
Ex 2 , y 2 , z 2 ) (cos 2 + cos )
When the diffraction field for an edge is calculated, one
(12)
jke jkrR at first derives diffraction coefficient D( n ) at the
E inc ( x1 , y1 ,0) d y1 d z1
4rR diffraction point Q and the incident end-field E(Q) , then
obtains the diffraction first-field at the diffraction point Q,
The expression rR = x 2 2 + ( y 2 y1 ) 2 + ( z1 ) 2 can be and at last calculates the diffraction field of the location s
expanded to second order as r = + ( z1 ) according to
2
distance from the diffraction point by employing the
R R
2 R diffusion factor A(s) and the phase delay factor exp( jks) .
R with R = x + ( y2 y1 ) , and the incident angle
2 2 Form Eq. (16), we can see that the most important
2
assignment is to derive the diffraction coefficient for
= 0 . calculating the diffraction field after we confirm the
Substituting r R into (12) and integrating over z1 in (12) trajectory of diffraction radial.
reduces to Diffraction occurs when the radio path between the
jk transmitter and the receiver is obstructed by a surface that
Ex 2 , y 2 ,0 E inc ( x1 , y1 ,0) has edges. In a forest, the primary diffracting obstacles,
4 (13) which perturb the propagating fields are trees. Diffraction
+ e jk + R
( z1 ) 2 formulas are well established for perfectly conducting
0 (1 + cos 2 ) R exp jk 2 R dz1 dy1 infinite wedges [14, 15], for absorbing wedges, and for

By using the expression u = z1 e j / 4 k / 2 R to carry out impedance-surface wedges [16]. The perfectly conducting
diffraction coefficients are accurate when dealing with
integral transform, equation (13) changes into diffraction phenomena arising from metallic objects.
e j / 4 k However, many applications, such as in a forest, involve
Ex 2 , y 2 ,0 E inc ( x1 , y1 ,0)
2 2 large dielectric structures with losses. In this case, the
assumption of perfectly conducting boundary conditions
+ e jk R (14)
0 (1 + cos 2 )
R
dy1 results in a lack of accuracy in predicting the actual
electromagnetic field. On the other hand, the impedance

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1355

surface diffraction formulas are rather cumbersome to use f=2.4GHz, the radius of standing tree a = l / 2 , source
for propagation prediction in a forest. Thus, the difficulty and receiver are equal height, namely diameter at breast
of using the rigorous solutions for propagation prediction height z=1.3m. We employ equiangular polygon of
forces simplifications to be made. Some existing eighteen to approximate the stem form. Let the distance of
diffraction coefficients modify the perfectly conducting the secondary source point located at ( x , y , z ) along the
UTD diffraction coefficient in order to make it applicable negative Y axis is y = a , and the location of the
to dielectric wedges with losses. For a normal-incident
plane wave, there is a general form of the perfectly receiving point be located at ( x , y , z ) along the positive Y
conducting UTD based diffraction coefficient that includes axis is y1 = a sin . The wedge angle = 150 , the
the existing solutions as special cases of it. The general relative dielectric constant r = 2.7275 and the
form can be expressed as
conductivity 0 .
Suppose Ei is the incident field. From (12), the
exp j ( )
D( n ) = 4 diffraction field of the receiver satisfies
Ed

2n 2k sin E ( s) = E
n n 1
(Q) D( n ) A(s) exp( jks) , then considering
m
an incident plane wave the diffraction loss of the receiver
+ 1
[
] 1
cot 2n F kL 1 + cot 2n F kL 1 [
]
power can be written in the form
E
= 20 log d ( dB)

+ cot
+ 1
[ +
F kL 1 + cot

] 1
[ +
F kL 1 ] Ei (22)
2n 2n (17)
The variation between the diffraction loss and the
where F (x ) is the transition function to solve no- dodecagon angle (Wedge angle)is show in Fig.4. The
uniform problem for Keller, and is defined as variation between the diffraction loss and frequency of
F ( x)2 j x exp( jx) exp( j 2 )d

source is show in Fig.5.
x (18)
and the parameter of distance L is given by
2 (19)
L = s sin 2
n
Moreover, the functions
1= 2 cos
2

( 2nN 1/ 2) and N is the smallest integer satisfying


the following equations


2nN + 1 =
2nN 1 = (20)
From (13), at least two functions among four cot
function are divergent at the shadow boundary and
reflection boundary. Meanwhile the corresponding
transition functions go to zero, thus the singularity is
removed.
Next we calculate the area of the blind region. The blind
region can be calculated from GTD and the reference [11] Figure 4. The variation between the diffraction loss of the receiver power
and the wedge angle
as
1 1

Fig.4. indicates that the inverse relation between the
3 + 3
2 2 (21) diffraction loss of the receiver power and the wedge angle.

1 The diffraction loss of the receiver power rapidly decreases
where with the wave number
= (k ) 3 with increasing wedge. When the wedge angle > 150 ,
k = 2 / and the wavelength in vacuum the receiving antennas in the deep region of diffraction is
= c / f = 300MHz / f . near to the wedge plant, which agrees with the expression
(11). Fig.5. shows that the diffraction loss of the receiver
IV. MODEL SIMULATION RESULT AND ANALYSIS power decreases with the increasing frequency of source,
i.e. the diffraction loss adds because the diffraction loss
We take the birch with the circumference l = 1 . 1 m as increases, wavelength is shortened and ability of the
example, the frequency of source point in the base station diffraction decreases with the increasing frequency.

2012 ACADEMY PUBLISHER


1356 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

So the theory error is 5.2%7% within the error


region without influencing sensor networking disposal.

V. CONCLUSION
In conclusion, we discuss the layout of the sensor for
the measurement of standing tree and the problem of the
diffraction loss and the blind region of diffraction in
optimization. The physical model of diffraction on
standing tree in plantation is presented by means of the
principle of Fresnel-Kirchhoff and uniform geometrical
theory of diffraction (UTD) . Moreover, the expression for
diffraction of standing tree is derived, the capability of
diffraction loss is analyzed, and the algorithm of the blind
region is discussed briefly. Finally, the feasibility and
validity of UTD for computing the diffraction loss of
standing tree in plantation is verified by theoretical
Figure 5. The variation between the diffraction loss of the receiver power analysis and simulation. This approach can compute,
and the frequency of source analyze and evaluate the diffraction loss field for standing
tree and blind region. The algorithm for diffraction of
140
standing tree can serve as a theoretical foundation for the
optimized layout of the sensor, and introduce a novel
120
technology for the telemetry measurement of the
environmental information.
100
From the simulation result, the propagation loss answers
to the law of the electromagnetic wave propagation.
80
However, there is error between the theory and
measurement due to the influence of vegetation and terrain
60
etc. Therefore, this model introduced in this paper needs to
further optimize. This method presented in this paper is
40
only an initial work, more work will be done.
20
ACKNOWLEDGMENT
0 This work was supported in part by the National Natural
0 50 100 150 200 250 300 Science Foundation of Zhejiang Province of China (Grant
No. Y12C160023 ). And Natural Science Foundation of
Figure 6. The experimental curve of diffraction of standing tree China (Grant No. 30972425). Corresponding author: Yun-
Jie Xu.
The field distribution of the birch measured by the
Protek3290N strength apparatus is shown in Figure.6. If REFERENCES
the diffraction fades less than -12dB, we recognized that
[1] Deng Hongbing, Hao Zhanqing. Study on Height
there is the blind region. The frequency of source is used
Growth Model of Pinus koraiensis. Chinese Journal of
with the amplitude 2.405GHz. The dots represent the
Ecolony,1999,18(3),pp.19-22
location distribution of sample point in real. From Fig.5.,
[2] Barrio-Anta, Dieguez-Aranda,Ulises, Castedo-Dorado,
by using approximation linear analysis, we have y=-
Fernando, Alvarez-Gonzalez, JuanGabriel. Mimicking
0.4508x+113.49, and thus the angle of the blind region,
natural variability in tree height of pine species using a
=arctg0.4508=4.27. If we use the average error 3
stochastic height-diameter relationship. New Zealand
of basal area for breast-height, can be written as Journal of Forestry Science, 2006,36(1),pp.21-34
1 1
[k (1.03a)] 3 [k (0.97a)] 3

(20) [3] Bardi,J.F, Villacampa,Y, Losardo,O, Borzone,H. A
study of the relationship height-diameter, Advances in
300 106 1
= = , = 1.1/ 2Ecological Sciences, Ecosystems and Sustainable
In our example of the birth, 2.405109 8.02 Development III, 2001,(10),pp.657-666
0 10 1 [4] LI Wei-ming ,Lv Xiao-de ,GAO Ben-qing ,LIU Rui-
= , a = 1.1 / 2 . From Eq. (20), we obtain 25.626.1.
5 10 8.02 xiang. Ray Path Tracing on Convex Surface with

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1357

applications to the Geometrical Theory of Applications to Reflector Antenna Analysis, Presented


Diffraction.2000, 28 ( 9):49-51. in Partial Fulfillment of the Requirements for the
[5] Wonn, H.T, O'Hara, K.L. Height . diameter ratios and Degree Doctor of Philosophy in the Graduate School
stability relationships for four northern Rocky of The Ohio State University, 2009.
Mountain tree speciesWestern Journal of Applied [15] Pathak, KouyoymjlanrgA uniform geometrical
Forestry,2001,16(2),pp.87-94 theory of diffraction for an edge in a perfectly
[6] Sharma;Mahadev. Height-diameter equations for conducting surface.Proc.IEEE.1974,pp.1448-1461
boreal tree species in Ontario using a mixed-effects
[16] Yun-Jie Xu, Wen-Bin Li, Strength Prediction of
modeling approach. Forest Ecology and
Propagation loss in Forest Based on Genetic-SVM
Management,2007, 249(3), pp.187-198
Classifier, 2010 Second International Conference on
[7] ZHANG Yu-zhu, CAO Zhi-wei, YAN Dun-liang, DAI
Future Computer and Communication, vol.03,pp. 251-
Yu-wei.Study on Quantitative Relations Between Tree
254, September 2010.
Measuring Factors of Pinus sylvestris var. mongolica
[17] Henry L. Bertoni, Radio propagation for modern
Plantation on Sand Land of Nenjiang River Protection
wireless system, Publishing House of Electronics
Forest Science and Technology, 2006,N0.01,pp.7-9.
industry. Beijing: 2002.
[8] Xian-Yu Meng. Forest mensuration(Second Edition).
China Forest Press.1995. Yun-jie XU received B.S., M.S. and
[9] BBBaker and ETCopson,The Mathematical Ph.D.degrees in forest engineering
Theory of Huygens Principle,2nd ed., Qxford from the University of Beijing Forestry
University Press,London,1953. of China, Beijing, China, in 1998, 2004,
[10] JHW hlttekerFrresnel-Kirchhoff theory applied and 2009, respectively.
From April 2004 to Dec. 2010, he
to terrain diffraction problems, Radio Sci, 1990, was a faculty with the School of
25(5),pp.837-851. technology , Zhejiang Agricultural &
[11] Paul R. ROUSSEAU and Prabhakar PATHAK, A Forestry University and was promoted
Time Domain Uniform Geometrical Theory of Slope to be an lecturer in 2005.
diffraction for a Curved Wedge, Turk J Elec Engin, His current research interests include system fault
2002,Vol.10(2),pp.385-398. diagnosis and signal propagation in forest.
[12] Mao-Guang Wang, Geometrical Theory of Wen-bin Li received M.S. and Ph.D. degrees in
Diffraction, China Xidian University Press, Beijing: Shizuoka University and Ehime University, Japan, in 1987,
1989. and 1992, respectively, Ph.D. advisor.
[13] Xiong-WenYi-Xi Xie, Diffraction over a Flat- From 1992 to Dec. 2011, he was a faculty with the
Topped Terrain Obstacle with Bevel Edge, Chinese School of technology, Beijing Forestry University and was
Journal of Electronics,1995, Vol.23(06),pp.81-83. promoted to be a professor in 2000.
[14] Youngchel Kim, B.S., M.S., On a Uniform His current research interests include forestry Forestry
Geometrical Theory of Diffraction based Complex machinery automation and intelligent.
Source Beam Diffraction by a Curved Wedge with

2012 ACADEMY PUBLISHER


1358 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Biddy a multi-platform academic BDD


package
Robert Meolic
Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia
Email: meolic@uni-mb.si

Abstract Biddy is a BDD package under GPL, developed The BDD package is computer software, more pre-
at the University of Maribor. It uses ROBDDs with comple- cisely a sort of mathematical library, which allows other
ment edges, as described in the paper K. S. Brace, R. L. programs to create and manipulate Boolean functions by
Rudell, R. E. Bryant, Efficient Implementation of a BDD using BDDs. Many different BDD packages are avail-
Package, 1990. Compared to other available BDD packages,
able, usually as a piece of free software. Among others,
Biddy's most distinguishing features are its specially de-
signed C interface and an original implementation of auto- Wikipedia [6] lists ABCD [7], BuDDy [8], CAL [9],
matic garbage collection. More generally, the Biddy project CMU BDD [10], CUDD [11], JDD [12], and Biddy [13],
is not only concerned with the computer library, but also of- the package that this paper is about. Biddy is a minimalis-
fers a demo application for the visualization of BDDs, called tic BDD package that includes only the necessary func-
BDD Scout. The whole project is oriented towards a read- tions. It uses ROBDDs with complement edges, as de-
able and comprehensible source code in C, which can be scribed in [14]. Biddy can be distinguished mostly by its
compiled unchanged on different platforms, including specially designed C interface and an original garbage
GNU/Linux and MS Windows. collection that is not based on a classic reference count.
Biddy is based on a BDD package written at the Uni-
Index Terms Boolean algebra, binary decision diagram,
symbolic manipulation of Boolean functions, formal meth- versity of Maribor in 1992 [15][16]. Hence, it can be
ods, free software categorized as one of the oldest BDD packages around
(not to brag but even D. E. Knuth remembers he "didn't
I. INTRODUCTION actually learn about binary decision diagrams until 1995
or so" [4]). In Maribor, the original BDD package was
Boolean algebra is a mathematical structure applied written in Pascal and ran on VAX 4000-600. The history
within many engineering and scientific fields, especially of its further development can be summarized as follows.
those concerned with electronics, computers, and com-
munications. The Binary Decision Diagram (BDD) is a Meolic included it into EST, a tool for the formal verifi-
data structure for representing Boolean functions. This cation of systems [17]. In 2006, the name Biddy first ap-
representation has gained popularity because it is canoni- peared. Separate library was formed in 2007. So far Bid-
cal, and thus tautology checking, satisfiability checking, dy has been developed as an academic software (as de-
and equivalence checking can be done in a constant time fined in [18]), thus clean implementation has been pre-
(after the BDD has been created). Moreover, it is a com- ferred over efficiency optimization .
pact representation of many of those Boolean functions In 2003, whilst being a part of EST, Biddy was includ-
that arise during practical problems. ed within a survey of 13 BDD packages [19] and was one
Binary decision diagrams are not just another theory. of two awarded mark A for code quality. No matter how
Many applications are heavily based on Boolean algebra subjective this classification is (the other A-graded pack-
and BDDs. Some successful examples are hardware de- age is the one maintained by the author of the survey :-) it
sign methods, e.g. logic synthesis [1], formal methods reflects the main goal of Biddy to promote a readable and
concerned with testing and verifying systems, e.g. sym- comprehensible source-code. The orientation towards
bolic model checking [2], and methods for knowledge educational purposes is also supported by the licence,
representation and discovery e.g. the rough-set theory [3]. which is GPL (published by FSF [20]), and the ability to
Recently, D. E. Knuth included an extensive section be compiled and used on different platforms, including
about BDDs in his famous monograph The Art of Com- GNU/Linux and MS Windows.
puter Programming [4], where it states that "(BDDs) Furthermore, this paper is organized as follows. Sec-
burst on the scene in 1986, long after old-timers like me tion II provides some basic terms and definitions about
thought that we had already seen all of the basic data BDDs. In Section III, Biddy is briefly described from the
structures that would ever prove to be of extraspecial im- user's point of view. Section IV gives details about the
portance" and that "(BDDs) have given me many more implementation. Section V introduces BDD Scout, a
surprises than anything else so far". And last, but not demo application for the visualization of BDDs which is
least, a pioneering paper on BDD algorithms [5] is one of being developed as part of the Biddy project. The conclu-
the most cited paper in the history of computer science! sion summarizes the current state of the project.

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1358-1366
JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1359

II. BINARY DECISION DIAGRAMS Precompiled packages include dynamically linked library
(i.e. *.so on GNU/Linux, *.dll on MS Windows,
The term Binary Decision Diagram was coined by S.
*.dylib on Mac OS X), and the appropriate C header
B. Akers in 1978 [21]. In his paper Akers predicted vari- 1
ous applications for BDDs but did not describe any useful biddy.h . Currently, there are no interfaces for other
implementation. His work was extended in 1986 by R. E. programming languages. The supplied C header is quite
Bryant who introduced Ordered BDDs and also showed small because of nice abstractions. Its data declaration
computer algorithms for their manipulation [5]. In 1990, part is given in Fig. 2 (this is real code, only some com-
Reduced Ordered BDDs were invented together with effi- ments have been removed).
cient recursive algorithms [14][22], and afterwards the
activities involving BDDs quickly became widespread
[23][24][25][26][4].
A Binary Decision Diagram is a directed, acyclic graph
with one root. Its leaves are called terminal nodes (also
All oth-
er nodes are called non-terminal nodes (also branch
nodes) and are labelled by variables. Every non-terminal

-
tions are associated with edges. A Boolean function rep-
resented by an edge is recursively calculated as
F v E vT ITE(v,T, E) , (1)

where v is a variable in the root (also called top vari-


able), E is a Boolean function represented by the root's
T is a Boolean function represented by Figure 1. ROBDD without complemented edges and ROBDD with
complemented edges for Boolean function
An Ordered Binary Decision Diagram (OBDD) is a
BDD where variables occur along every path from the a c a b c bc
root to a leaf in strictly ascending order, with regard to
fixed ordering. Algorithms appear to be much more effi- /* Constant definitions */
cient if they can assume the same variable order for the #define FALSE 0
#define TRUE !FALSE
all involved OBDDs. Moreover, the size of an OBDD
heavily depends on its variable order. An OBDD is a Re- /* Macro definitions */
duced Ordered Binary Decision Diagram (ROBDD) if it #define Biddy_isEqv(f,g) \
contains neither isomorphic subgraphs nor nodes with (((f).p == (g).p) && \
isomorphic descendants. The most important property of ((f).mark == (g).mark))
ROBDD is the canonicity of the representation. The other #define Biddy_isTerminal(f) \
one is node sharing (requires the same ordering of all ((f).p == biddy_termTrue.p)
ROBDDs), i.e. when more than one Boolean function is #define Biddy_isNull(f) \
((f).p == biddy_termNull.p)
simultaneously represented, the merging of isomorphic
subgraphs is applied between all of them. Hence, every /* Type declarations */
node can belong to more functions. typedef char Biddy_Boolean;
An important extension of ROBDDs is the typedef char *Biddy_String;
introduction of complemented edges. Every edge has an typedef int Biddy_Variable;
additional field (a single bit is sufficient), which is used typedef void (*Biddy_VoidFunction)();
to distinguish between the regular and complemented
edges. A complemented edge complements the /* Structure declarations */
typedef struct Biddy_Edge {
represented Boolean function. In this way, there is no void *p;
need to keep termi Biddy_Boolean mark;
by the comple } Biddy_Edge;
way to main
regular in every node. Examples of a ROBDD with and /* Variable declarations */
without complemented edges are given in Fig. 1. EXTERN Biddy_Edge biddy_one;
EXTERN Biddy_Edge biddy_zero;
EXTERN Biddy_Edge biddy_null;
III. MANIPULATION OF BOOLEAN FUNCTIONS USING
BIDDY LIBRARY Figure 2. Data declaration part of biddy.h

Biddy is written in C. The source code is purified and


improved in such a way that it could be compiled with a
C or C++ compiler without errors and warnings. 1
The header is included in development version, only.

2012 ACADEMY PUBLISHER


1360 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Constant and macro definitions are self-explanatory Restriction. For the given Boolean function G ,
and so are types Biddy_Boolean and Biddy_String. given variable x , and given constant c {0,1}
Biddy_Variable is a type used for refering to variables calculate G | xc . Function Biddy_Restrict
in BDD. The variables are stored in a table and refer-
enced by indices. For the user, it is unimportant, how the calculates restriction.
variables are stored, they could be, for example, also Composition. For the given Boolean functions
stored in a binary tree and referenced by pointers (assum- G and H and given variable x , calculate
ing that pointers are compatible with integers). BDD G | x H . Function Biddy_Compose calculates
- composition.
tion of Biddy package (performed by function Existential and universal quantification. For the
Biddy_Init), whilst all other BDD variables must be given Boolean function G and given variable x ,
created explicitely with function Biddy_FoaTerminal. calculate x.G and x.G . Functions Biddy_E
Type Biddy_VoidFunction is used to declare functions and Biddy_A calculate the existential and
that extend Biddy's capability of memory management. universal quantifications, respectively.
These special functions are started each time Biddy tries
to free redundant memory, and are intended to delete in- As an example, a simple program using Biddy library
valid entries in user caches. is given in Fig. 3. It calculates the 13-th minterm of the
The most important part of Biddy's external header is Boolean function F (a , b, c, d ) . On the Ubuntu
its declaration of structure Biddy_Edge. Each edge in a (GNU/Linux) system, where Biddy library was installed
BDD corresponds to one Boolean function. Biddy_Edge from the available deb package, this program is compiled
consists of a pointer to a node and an optional mark. Void with the following command:
pointer is used o achieve separation of interface and im-
gcc -DUNIX -o mint13 mint13.c -lbiddy
plementation. Although this is not strict encapsulation (C
does not have such a mechanisms as, for example C++), On MS Windows (either 32 or 64 bit version) you have to
the user is encouraged to use the provided API functions, use:
only, and not to travel through the BDD by direct usage
gcc -DWIN32 -DUSE_BIDDY -o mint13.exe
of pointers nor rely on the node's internal structure. mint13.c lbiddy
Among others, Biddy_GetThen, Biddy_GetElse, and
Biddy_GetVariable are functions available in API.
#include <biddy.h>
The variables are defined by the keyword EXTERN
which is in fact a macro defined before the data declara- int createMinterm13() {
tion part. On GNU/Linux and Mac OS X systems, this Biddy_Edge a,b,c,d,TMP1,TMP2,F;
macro is simply expanded to keyword extern but on MS
Windows it is expanded to a certain code which is re- a = Biddy_FoaTerminal("a");
quired for the declaration of external symbols in DLL b = Biddy_FoaTerminal("b");
c = Biddy_FoaTerminal("c");
files (moreover, it is a different expansion whether you
d = Biddy_FoaTerminal("d");
build a DLL or just use it). The edges biddy_one and TMP1 = Biddy_ITE(a,b,biddy_zero);
biddy_zero represent Boolean functions 1 and 0, re- TMP2 = Biddy_ITE(
spectively, whilst biddy_null represents a non-valid Biddy_NOT(c),d,biddy_zero);
(null) edge. F = Biddy_ITE(TMP1,TMP2,biddy_zero);
Biddy is capable of all the typical operations regarding printf("F has %d nodes.\n",
Boolean functions. Biddy_NodeNumber(F));
}
Tautology checking. For the given Boolean
function G check if G 1 . Macro int main() {
Biddy_isEqv is suitable for performing this Biddy_About();
operation. Biddy_Init();
Equivalence checking. For the given Boolean createMinterm13();
Biddy_Exit();
functions G1 and G2 check if G1 G2 . Again, return 0;
macro Biddy_isEqv is suitable for performing }
this operation. Figure 3. A simple program using Biddy library
Complement. For the given Boolean function G
calculate G . Function Biddy_NOT can be used to IV. IMPLEMENTATION DETAILS
calculate complement. Biddy sources consist of files biddy.h, biddyInt.h,
Binary operations (AND, OR, etc.). For the biddyMain.c, biddyStat.c, and those used for com-
given Boolean functions G1 and G2 and the piling and packaging. File biddyMain.c includes defini-
given binary operation , calculate G1 G2 . tions of constants, variables and functions, except those
Function Biddy_ITE is capable of calculating all used for statistics; these have been separated and put into
binary operations on Boolean functions. file biddyStat.c. Coding style, i.e. the naming of func-

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1361

tions, macros, and variables, the style of documentation, /* NODE TABLE: a hash table with chaining */
the organization and naming of source files etc. respects typedef struct BiddyNode {
conventions from [27]2: struct BiddyNode *prev, *next, *list;
Biddy_Variable v;
exported functions, macros, types, and structures
Biddy_Edge f, t;
have a prefix Biddy_, e.g. int count;
Biddy_SimpleFunction, they are all defined in } BiddyNode;
file biddy.h (also called external header) typedef struct {
internal functions, macros, types, and structures BiddyNode **table;
BiddyNode **blocktable;
have a prefix Biddy, e.g. int size;
BiddySimpleFunction, they are all defined in int generated, blocknumber, max, num,
file biddyInt.h (also called internal header) numf, foa, compare, add, garbage;
local functions, macros, types, and structures } BiddyNodeTable;
(visible to one file only) do not have a prefix, e.g.
/* VARIABLE TABLE: dynam. allocated table */
SimpleFunction,
typedef struct {
exported variables have a prefix biddy_, e.g. Biddy_String name;
biddy_simpleVariable, int order;
internal variables have a prefix biddy, e.g. Biddy_Edge term;
biddySimpleVariable, Biddy_Boolean value;
} BiddyVariable;
local variables (visible to one function or one file typedef struct {
only) do not have a prefix, e.g. simpleVariable. BiddyVariable *table;
The main data structures are the Node Table, ITE int size;
Cache, EAX Cache, and some special lists for memory } BiddyVariableTable;
management. They are all declared in file biddyInt.h,
/* CACHE LIST: unidirectional list */
as given in Fig. 4. typedef struct BiddyCacheList {
Node Table is a hash table with chaining. It stores all struct BiddyCacheList *next;
nodes and ensures a quick search for a node with given Biddy_UserFunction gc;
variable and references to descendants. It also prevents } BiddyCacheList;
multiple instances of nodes with the same variable and
references to descendants. Adding and searching for /* ITE CACHE: a fixed-size hash table */
typedef struct {
nodes are both done by function Biddy_FoaNode using
BiddyNode *f, *g, *h;
the Find-Or-Add principle. The variable and descendants Biddy_Edge result;
- Biddy_Boolean hmark;
spectively. The chains of nodes are bi-directional (ele- Biddy_Boolean ok;
- } BiddyIteCache;
rectly addressed nodes. If a particular chain is empty, the typedef struct {
hash table a contains null pointer. Otherwise, it contains a BiddyIteCache *table;
int size;
pointer to the first node in the chain. The first element of int search, find, overwrite;
each chain has } BiddyIteCacheTable;
a pointer but not one to a regular node) to enable the cor-
rect relinking when it is being deleted. Moreover, the /* EAX CACHE: a fixed-size hash table */
typedef struct {
in the structure BiddyNode *f;
BiddyVariable v;
BiddyNode, respectively, and that the hash function used
Biddy_Edge result;
to spread the nodes across the table never returns zero. Biddy_Boolean fmark;
ITE Cache and EAX Cache are fixed-size hash tables. Biddy_Boolean ok;
ITE Cache stores the arguments and results from the per- } BiddyEAxCache;
formed ITE operations. All binary operations are imple- typedef struct {
mented via the ITE operation, as shown in Fig. 5. This is BiddyEAxCache *table;
efficient not only because one algorithm is sufficient for int size;
int search, find, overwrite;
all operations but also because, in this way, all operations } BiddyEAxCacheTable;
share the same cache. In order to avoid distinguishing the
same calls (e.g. ITE( f , g ,0) ITE( g , f ,0) ), the ar- Figure 4. Structure declaration part of biddyInt.h

guments given to the ITE operation are transformed into a shown in Fig. 6. EAX Cache stores the parameters and
predefined normal form before being stored in the ITE results of the performed existential and universal
Cache. This normalization is performed according to the quantifications. One cache is efficiently used to store the
rules given in [14]. The algorithm for operation ITE is results from both operations ( a. f a. f ). In-
deed, any cache is limited and if a certain result has been
deleted (they are in fact overwritten), then the calculation
2
These conventions originate in J. Ousterhout's Tcl/Tk Eng. Manual. has to be performed again.

2012 ACADEMY PUBLISHER


1362 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Biddy_ITE(F,G,H) {
Truth normalization of arguments F,G,H
Symbol Boolean base ITE base if simple call return result
table
if result in ITE Cache return result
0000 0 0 0 v = the smallest top variable of F,G,H
T = Biddy_ITE(F,G,H)v=1
0001 ff g
g f g ITE( f , g ,0) E = Biddy_ITE(F,G,H}v=0
0010 result = Biddy_FoaNode(v,T,E)
f g f g ITE( g ,0, f ) store F,G,H, and result into ITE Cache
0011 f f f return result

g
}
0100 f f g ITE( f ,0, g ) Figure 6. Algorithm for ITE
0101 g g g
are assumed (by the user) to be needed in the future.
0110 f g f g f g ITE( f , g , g ) Fresh nodes have been created during the current
0111 f g ITE( f ,1, g ) calculation and must remain, at least until the end of this
f+g
calculation. Bad nodes are safe to delete at any moment.
1000 f g ITE( f ,0, g ) Indeed, all descendants of a fortified node should be
f g
fortified and the descendants of a fresh node must not be
1001 f g f g f g ITE( f , g , g ) bad. Biddy offers function Biddy_isOK to check
1010 g g ITE(g ,0,1) whether a particular node is not bad.
When a new node is created, it becomes a fresh node.
1011 f g f g ITE( g , f ,1) Moreover, all its descendants are being refreshed, i.e.
1100 ff g ITE( f ,0,1) they are changed into fresh nodes if they are currently
f bad. Refreshing is done by the recursive function
1101 f g f g ITE( f , g ,1) Biddy_Fresh. When the calculation is finished, the user
1110 can fortify its result, i.e. change all its nodes into fortified
f |g f g ITE( f , g ,1) ones. Fortifying is done by the recursive function
1111 Biddy_Fortify. At any time, the user can change all
1 1 1
the fresh nodes into bad nodes by calling
Biddy_IncCounter, which simply increments a
Figure 5. Binary operations on Boolean functions
variable biddyCount. This is usually used to separate
different calculations, i.e. to mark all redundant nodes
The efficiency of a BDD package depends heavily on from the already finished calculations.
the adequacy of the memory management, i.e. deleting Function Biddy_IncCounter or any other function
nodes which were created during previous calculations mentioned in the previous description does not actually
but are no longer needed for further calculations. This op- start garbage collection, i.e. node deleting. This is started
eration is called garbage collection. For the deleted node, periodically by the system or explicitly by calling func-
any record in any cache referencing this node (either as tion Biddy_Garbage. Deleting nodes as often as possib-
argument or result) is invalid, and may be reused. Biddy le is not the best strategy. A currently unnecessary node
has the Cache List that keeps references to those func- may become needed by the very next operation and it
tions intended for marking invalid entries in caches. ITE would have to be recreated. Hence, Biddy uses the fol-
Cache and EAX Cache are registered on the Cache List lowing approach. An amount (a memory block) of BDD
during initialization, whilst any user-defined cache must nodes is created during initialization. When all these
be registered using function Biddy_AddCache. nodes have been used, garbage collection starts to delete
The main problem with garbage collection is the detec- unnecessary ones. A new memory block full of BDD
tion of unnecessary nodes. Some help from the user is re- nodes is created if none of the nodes can be deleted. Ref-
quired because, during different calculations, a lot of tem- erences to allocated memory block are stored in the
porary nodes are created and it is hard for the system to Memory Block Table (element blocktable in structure
automatically guess which results will be used during fur-
BiddyNodeTable).
ther calculations. It could be assumed that all results will
A special multi-purpose pointer list is used as part of
be needed in the future but this permissive strategy is, in
every node in order to support efficient memory manage-
fact, very poor for the majority of practical applications.
ment. When a block of new BDD nodes is created, all the
Biddy implements an original method called garbage
nodes are linked by this pointer into the List of Free
collection with a formulae counter not used by other
Nodes (its beginning is referenced by pointer
BDD packages. It works via an internal global variable
biddyFreeNodes). Nodes deleted by garbage collection
biddyCount and an element count in every node. If a
are not deallocated from memory, they are just returned
node s count is equal to zero then this is a fortified node,
to this list. In this way, the number of time-consuming re-
if it is equal to the value of biddyCount then this is a quests for allocating and deallocating memory is greatly
fresh node, and otherwise it is a bad node. Fortified nodes reduced. When a new node is needed, it is simply taken
belong to those results from previous calculations which

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1363

from the List of Free Nodes. For the nodes in use, the F1 = create a BDD
pointer list is reused to link the node to another list F2 = create a BDD using F1
called List of New Nodes (its beginning and end are ref- Biddy_Fortify(F2);
erenced by the pointers biddyFirstNewNode and Biddy_IncCounter();
biddyLastNewNode, respectively). List of New Nodes F3 = create a BDD using F2
greatly reduces the time for garbage collection because Biddy_Fortify(F3);
Biddy_IncCounter();
only nodes from this list are checked and not the whole
Node Table. It is even unnecessary to look over the whole Figure 7. Simple example of controlling garbage collection
list, as from a particular node forward to the end, the list
contains only fresh nodes. This arbitrary node is refer- G = biddy_zero;
while (some condition) {
enced by the pointer biddyFreshNodes and this refer-
Biddy_IncCounter();
ence is transferred to the end of the list each time Biddy_Fresh(G);
BiddyCount is incremented. During the garbage collec- F1 = create a BDD using G
tion all fortified nodes are removed from the List of New F2 = create a BDD using G
Nodes and, hence, the next call will not be bothered with G = create a BDD using F1 and F2
them. Indeed, they must be returned to the list if their sta- }
tus is changed from fortified to fresh. For all fortified Biddy_Fortify(G);
nodes removed from the List of New Nodes, the pointer Biddy_IncCounter();
list is reused once again to connect them to the List of Figure 9. Another example of controlling garbage collection
Fortified Nodes, which allows successive visiting of
them. The algorithm for garbage collection with a formu-
V. BDD SCOUT
lae counter as used in Biddy, is shown in Fig. 7.
GarbageCollection () { BDD Scout is a tool for the visualization of BDDs (see
if some bad nodes could exist { the screenshot given in Fig. 10). It serves as a example
do all functions from Cache List application when demonstrating the capabilities of Biddy.
forall nodes in List of New Nodes { There exist other free BDD visualization tools, for exam-
if node is fortified ple BDDTCL [28], BDD Visualizer [29], and JADE [30].
move to List of Fortified Nodes BDD Scout has been developed completely independent-
if node is fresh ly of them and although the current version is really more
do nothing
a demo rather than a final product, it already includes
if node is bad {
delete it from Node Table comparable or even innovative functions.
move to List of Free Nodes BDDTCL was an early bird. It has not been updated
} for a long time and its capabilities are (according to the
} available screenshot) similar to BDD Scout. BDD Visual-
} izer is a web-based application. It generates PDF docu-
} ments. It is not as flexible as BDD Scout and the user can
Figure 8. Garbage collection in Biddy
not adjust a generated graph nor interactively explore it.
If the status of the nodes is not controlled by the user, JADE is the most sophisticated software from these
all nodes will remain fresh forever and the garbage col- group. It is implemented in JAVA and produces nice
lection simply does nothing. There is nothing wrong with graphics. It allows for the study of different variable or-
this (for small examples this could even be the most effi- dering algorithms and allows for good navigation possi-
cient method), but you will be unable to manipulate large bilities but, other that this, there is nothing spectacular.
Boolean functions in such a way. A simple example of Being subjective, the output graphs of BDD Scout can
controlling garbage collection is given in Fig. 8. It shows sometimes be even more suitable for publications.
the calculation of function F2, which requires calculation BDD Scout consists of two parts.
of a temporary function F1. After the construction of F2, The calculation part is written in C and consists of
function F1 is no longer needed, whilst function F2 itself creating, importing, and exporting BDDs, cal-
is supposed to be a useful result. Thus, it is fortified and culating different statistics, and performing
can be used during the calculation of F3. Function F1 benchmarks.
must not be used during the calculation of F3 because GUI and the drawing part are extensions of a
garbage collection may delete some nodes of F1 before separately developed Tcl/Tk application bddview,
the calculation of F3 is finished. and allows to load and save a graph, adjust the
Another, more complex example is shown in Fig. 9. graphical representation, and create PNG image
Here, function G is being iteratively computed. The tem- and PDF file (by utilizing ghostscript).
porary functions F1 and F2 are created during each step. The calculation part currently includes a parser for
The final G may be huge, therefore we allow for the simple recursive BDD representation and a parser for the
deleting of unnecessary nodes as the calculation goes prefix form of Boolean functions used in IFIP/ISCAS
along. We do not need intermediate results for G, only the benchmarks. For the time being, only one benchmark is
final one, and therefore we do not use Biddy_Fortify implemented (the code is maintained in separate files
during the calculation. bddscoutIFIP.c and bddscoutIFIP.tcl).

2012 ACADEMY PUBLISHER


1364 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Figure 10. Screenshot of BDD Scout v1.0 (arrows off, grid on, part of the graph is being selected)

A recursive BDD representation recognized by BDD Biddy


Scout has the following rules: B (* i (d (1) (y (* 1) (1)))
The first word is the name of the BDD. (d (y (* 1) (1)) (1)))
The second word is the name of the top variable. ( i (d (1) (y (* 1) (1)))
Any variable name is followed by a description of (d (y (* 1) (1)) (1)))
two subgraphs given in parentheses. Figure 11. A recursive BDD representation supported by BDD
Scout
Symbol * is used to denote complement edges.
(B i d y)
Spacing and indentation are unimportant. s1 = (or B (not y))
s2 = (or B i d)
An example of recursive BDD representation is given in s3 = (or B (not i) (not d))
Fig. 11. The obtained graph is given in Fig. 14. s4 = (or (not B) i (not d) y)
The prefix form of Boolean functions (as used in s5 = (or (not B) (not i) d y)
IFIP/ISCAS benchmarks) has the following rules: Biddy = (and s1 s2 s3 s4 s5)
The file is optionally started with the set of
Figure 12. A prefix representation of Boolean functions supported
variables given in the parentheses (to determine by BDD Scout
variable ordering).
There can be many Boolean functions within the Application bddview, which is used in the drawing
same file. part of BDD Scout is a single Tcl/Tk script. It is a graph
Spacing and indentation are unimportant but the viewer only and does not directly use Biddy or any other
function's name and symbol = must be given on BDD package. However, it is not a general graph viewer;
the same line. it is in many ways optimized to visualize ROBDDs with
Supported operators (and also reserved words) are complement edges. Internally, bddview uses a special tex-
NOT, OR, AND, and EXOR, written either tual representation that contains exact coordinates of all
uppercase or lowercase. nodes. In order to show graph for a particular BDD, BDD
The example of prefix form of Boolean functions is given Scout in the first place produces an appropriate
in Fig. 12. The obtained graph is given in Fig. 14 (it is description in the bddview format. Program dot from the
the same as in previous example). graphviz package [31] is utilized to determine position of
each node.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1365

The bddview format consists of the following


constructs:
label <n> <name> <x> <y>

node <n> <name> <x> <y>
terminal <n> 1 <x> <y> terminal 4 1 50 240
connect <n1> <n2> <type>
Here, <n> is the unique number (integer), <name> is a terminal 6 1 100 290
string, <x> and <y> are the coordinates (integer), and
<type> is a label, which can be one of the following: terminal 8 1 150 240
connect 0 1 s
connect 1 2 d
connect 2 3 l
connect 2 7 r
connect 3 4 l
connect 3 5 r
connect 5 6 d
connect 7 5 l
The following remarks should be considered: connect 7 8 r
(0,0) is the top left-hand corner,
Figure 13. An example of bddview format
only one label is supported,
single line and inverted single line should be used
to connect a label and a node, only,
line to the right successor cannot be inverted,
when using double line, the line to the left
successor is automatically inverted.
An example of bddview format is given in Fig. 13, and
the obtained graph is (again) given in Fig. 14.

VI. CONCLUSION
Biddy is a BDD package suitable for educational
purposes and also usable in prototyped research tools. It
has already been used for quite some time in different
projects. Hence, it is very unlikely that it contains major
bugs. Nevertheless, it is still being actively developed and
upgraded whilst many other free BDD packages are no
longer supported by their authors. Building and installa-
tion procedures are not ideal, as yet, but precompiled bi-
nary packages are being tested for working on various
systems. Debian and rpm packaging has been added, re-
cently. Figure 14. The graph described by textual representations given in
This paper has not compared the effectiveness of Bid- Fig. 11, Fig. 12, and Fig. 13
dy with other popular BDD packages. By considering the
goals of the project so far, it was found that such a com-
parison would be unsuitable, as yet (which does not mean be simply used to illustrate interesting programming para-
that Biddy is much slower than others). Biddy uses the digm. On the other hand, it can be used to explore the
classical depth-first approach. It also uses common data axioms and theorems of Boolean algebra (by the equiva-
structures (node table, cache tables), which are imple- lence checking of Boolean formulae). Its more obvious
mented straightforwardly without tricky shortcuts. A sim- usage is to help students understand the details of the
ple implementation style has been used intentionally to BDD package. And last, but not least, it can be used as an
improve the readability of the source code. Some opti- engine for research applications as, for example, demo
mizations are planned in the future. In addition to this implementation of Quine-McCluskey minimization [32]
work, the package will soon be extended with different or very real-world formal verification of systems, e.g.
algorithms for reordering. stuck-at faults detection [33]. Moreover, BDD Scout can
The most original part of Biddy is its implementation be extended in order to show how BDD is constructed,
of garbage collection. This method using a formulae step by step, how BDD representation changes if a differ-
counter is still being investigated and probably, even ent variable ordering is selected, to show the content of
more advantages will be seen. Memory management is, the cache, to demonstrate how the cache hits the speed of
of course, the main factor of any powerful BDD package BDD computation, how garbage collection is triggered,
and, hence, it will get a lot of attention during the ongo- etc.
ing research.

2012 ACADEMY PUBLISHER


1366 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

REFERENCES [19] G. Janssen, "A Consumer Report on BDD Packages", In:


16th Symposium on Integrated Circuits and Systems
[1] S. N. Yanushkevich, D. M. Miller, V. P. Shmerko, R. S. Design, 2003, pp. 217.
Stankovic. Decision diagram techniques for micro- and [20] Free Software Foundation, Inc. On-line (21/10/2011).
nanoelectronic design handbook. CRC Press, 2006. http://www.fsf.org/
[2] C. Baier J.-P. Katoen. Principles of Model Checking. The [21] S. B. Akers, "Binary decision diagrams", IEEE
MIT Press, 2008. Transactions on Computers, Vol. C-27, No. 6, 1978, pp.
[3] Q. Wei, T. Gu, "Symbolic Representation for Rough Set 509-516.
Attribute Reduction Using Ordered Binary Decision [22] S. Minato, N. Ishiura, S. Yajima, "Shared Binary Decision
Diagrams", Journal of Software, Vol. 6, No. 6, 2011, pp. Diagram with Attributed Edges for Efficient Boolean
977-984. Function Manipulation", In: 27th ACM/IEEE Design
[4] D. E. Knuth. Art of Computer Programming, Volume 4, Automation Conference (DAC'90), 1990, pp. 52-57.
Fascicle 1: Bitwise Tricks & Techniques; Binary Decision [23] R. E. Bryant, "Binary Decision Diagrams and Beyond:
Diagrams. Addison-Wesley Professional, 2009. Enabling Technologies for Formal Verification", In:
[5] R. E. Bryant, "Graph-Based Algorithms for Boolean IEEE/ACM International Conference on Computer-Aided
Function Manipulation", IEEE Transactions on Computers, Design (ICCAD '95), 1995, pp. 236-243.
Vol. C-35, No. 8, 1986, pp. 677-691. Reprinted in M. Yoeli, [24] R. Drechsler, B. Becker. Binary decision diagrams: theory
Formal Verification of Hardware Design, IEEE Computer and implementation. Springer, 1998.
Society Press, 1990, pp. 253-267. [25] C. Meinel, T. Theobald. Algorithms and Data Structures in
[6] Wikipedia: Binary decision diagram. On-line (21/10/2011). VLSI-Design: OBDD Foundations and Applications.
http://en.wikipedia.org/wiki/Binary_decision_diagram Springer-Verlag, 1998.
[7] ABCD. On-line (21/10/2011). [26] R. Ebendt, G. Fey, R. Drechsler. Advanced BDD
http://fmv.jku.at/abcd/ optimization. Springer, 2005.
[8] BuDDy. On-line (21/10/2011). [27] S. Edwards, G. Swamy, "The VIS Engineering Manual",
http://buddy.wiki.sourceforge.net/ 1996. On-line (21/10/2011).
[9] CAL. On-line (21/10/2011). http://vlsi.colorado.edu/~vis/prgDoc.html
http://embedded.eecs.berkeley.edu/Research/cal_bdd/ [28] BDDTCL. On-line (21/10/2011).
[10] CMU BDD. On-line (21/10/2011). http://www2.parc.com/csl/members/kpartrid/
http://www-2.cs.cmu.edu/~modelcheck/bdd.html [29] BDD Visualizer. On-line (21/10/2011).
[11] CUDD. On-line (21/10/2011). http://www.cs.uc.edu/~weaversa/BDD_Visualizer.html
http://vlsi.colorado.edu/~fabio/CUDD/ [30] JADE: Implementation and Visualization of a BDD
[12] JDD. On-line (21/10/2011). Package in JAVA. On-line (21/10/2011).
http://javaddlib.sourceforge.net/jdd/ http://www.informatik.uni-bremen.de/agra/eng/jade.php
[13] Biddy. On-line (21/10/2011). [31] Graphviz. On-line (21/10/2011).
http://lms.uni-mb.si/biddy/ http://www.graphviz.org/
[14] K. S. Brace, R. L. Rudell, R. E. Bryant, "Efficient Imple- hancing Quine-McCluskey", 2007.
mentation of a BDD Package", In: 27. ACM/IEEE Design COMPASSS Working Paper WP 2007-49.
Automation Conference (DAC'90), 1990, pp. 40-45. http://www.compasss.org/pages/resources/wpfull.html
ic
model checking for sensing stuck-at faults in digital
pp. 299-307. circuits", Inf. MIDEM, Vol. 32, No. 3, 2002, pp. 171-180.
In Slovene.

with ROBDDs", 1993. Presented at IEEE Region 8 Student Robert Meolic received his Ph.D from the University of
Paper Contest, Paris-Evry 1993. Published in: IEEE Maribor, Slovenia in 2005. He is currently an Assistant
Student paper contest: regional contest winners 1990-1997, Professor at the Faculty of Electrical Engineering and Computer
IEEE, 2000. Science at the same university. His main research interests
[17] EST. On-line (21/10/2011). include Boolean algebra, binary decison diagrams, temporal
http://lms.uni-mb.si/EST/ logic and model checking. Dr. Meolic is a member of IEEE and
[18] S. Paumier, "Why academic software should be Open the Slovenian Electronic Communication Society SIKOM.
Source", INFOtheca: Journal of informatics and
librarianship, Vol. X, No. 1-2, 2009, pp. 51-54.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1367

Implementation of multi-objective evolutionary


algorithm for task scheduling in heterogeneous
distributed systems
Yuanlong Chen, Dong Li, Peijun Ma
Harbin Institute of technology ,Heilong jiang, China
E-mail:cyuanlong@126.com

AbstractThis paper presents an effective method for task Mapping heuristic (MH)[1], dynamic level scheduling
scheduling in heterogeneous distributed systems. Its (DLS) algorithm, levelized min time (LMT) algorithm[2],
objective is to minimize the last tasks finish time and to Critical-path-on-a-Machine (CPOP) algorithm, and
maximize the system reliability probability. The optimum is heterogeneous earliest finish time algorithm[3]. The HEFT
carried out through a non-domination sort genetic
algorithm. The experimental results based on both
algorithm significantly outperforms the DLS, MH, LMT
randomly generated graphs and the graphs of some real and CPOP algorithms in terms of average schedule length
applications showed that, when compared to two well ratio [4,5].The HEFT algorithm SELECTS the tasks with the
known previous methods, such as heterogeneous earliest so-called highest upward rank value at each step, and
finish time (HEFT) algorithm and Critical Path Genetic assigns the selected task to the processor to minimize its
Algorithm, this algorithm surpasses previous approaches in earliest finish time. Tasks mean computational time on all
terms of both last tasks finish time and the system processors and the mean communication rates on all links
reliability probability. were used to compute the upward rank value.
Recently, Genetic Algorithms (GAs) was widely
Index TermsDAG scheduling, ask graphs, heterogeneous
system, non-domination sort genetic algorithm
reckoned as a useful meta-heuristics for obtaining high
quality solutions for a broad range of combinational
optimization problems which included task scheduling
I. INTRODUCTION [6][7]
. The GA operates on a number of solutions. Another
Software engineer play an important role in control merit of genetic is that its inherent parallelism can be
software especially in some safety critical system. exploited to further reduce its running time. However,
Correct implementation of software ensures proper Standard GA algorithms for task scheduling are
operation of these systems. EXCELLENCE task scheduling monolithic, as they attempt to scan the entire solution. To
strategy will be to reduce the probability of these enable the GA algorithm search the solution more
systems software error. effectively, CPGA was proposed .CPGA is based on
With the recent advancements in massive parallel standard GA algorithm with some heuristic principles that
processing technologies, the problem of scheduling tasks has been added to improve its performance [8].
in multiprocessor system is becoming increasingly Unfortunately, most of these algorithms can not
important. The problem of scheduling task graph of a minimize the execution length, and at the same time
parallel program ONTO a parallel and distributed maximize the systems reliability. While this problem
computing system is a well defined NP-complete problem. requires the simultaneous optimization of more than one
This problem involves mapping a Directed Acyclic Graph non-commensurable and competing criterion, solutions to
(DAG) for a COLLECTION of computational tasks and the multi-objective optimization problem are usually
their data precedence onto parallel processing systems. computed by combining them into a single criterion to be
Over the past few years, they have become the most optimized. In this paper, a new modified GA algorithm is
attractive option for high performance computing and proposed, namely, HEFT-No dominated Sorting Genetic
information processing. They have been increasingly Algorithm (HEFT-NSGA) which seek to solve the multi-
employed for critical applications such as aircraft control, objective optimization problem in task scheduling.
industrial process control, and etc. Increased This paper is organized as follows: Section 2 surveys
commercialization of heterogeneous distributed the related work of our study. In section 3, our HEFT-
computational systems pertains to the fact that ensuring NSGA for task SCHEDULING is presented. Experimental
system reliability is of critical importance. Therefore, the results are provided in Section 4 followed by conclusion
goal of a task scheduler is to assign tasks to available in Section 5.
processors such that precedence requirements for these
tasks are satisfied, with the overall execution length (i.e., II. RELATED WORK
makespan) minimized, while the reliability of the system
is maximized. A task scheduling system model consists of an
Some scheduling algorithms are therefore proposed to application, a target computing environment, and a
deal with the heterogeneous SYSTEMS; for example, performance criteria for scheduling.

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1367-1374
1368 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

A. Related definition 6) For a task vi , EST (i , j ) and EFT (i, j ) are the
An application is represented by a directed acyclic
scheduled earliest start time of task vi on processor p j ,
graph, G = (V , E ), where V is the set of tasks and E
is the set of edges between the tasks. Each edge and the scheduled earliest finish time of task vi on
e(i, j ) E represents the precedence constraint such processor p j respectively.
that, task vi should complete its execution before
EST (i, j ) = max(max( ft (k ) + comu(k , i )) (2)
task v j starts. Data is an m m ( m is the number of
tasks) matrix of communication data, where data i , j is R ( p j ))vk parent (vi )
(3)
the amount of data required to be transmitted from task
vi to v j . EFT (i, j ) = EST (i, j ) + w(i, j ) (4)
In a given tasks graph, a task without any parent is
called entry task, and a task without any child is called an 7) The Data Arrival Time ( DAT ) of vi at processor
exit task. In this paper, we have asserted that the task
graph is a single-entry-single-exit task graph. If there are
p j is defined as:
more than one entry (exit) task, they are connected to a
zero-cost pseudo entry (exit) task with zero-cost edges,
DAT (i, j ) = max( ft (k ) + comu(i, k ))
which do not affect the schedule.
We assume that the target computing environment vk parent (vi ) (5)
consists of n heterogeneous processors, p1 , p2 pn ,
connected in a fully connected topology. Let W be an If tasks vk and vi are scheduled on the same
m n computation cost matrix in which each wi , j processor, then comu (i , k ) equals zero.
8) The parent task that maximizes the above
gives the execution time to complete task vi on
expression is called the favored predecessors of vi and it
processor p j .
is denoted by favored (vi , p j ) .
Each processor may fail due to hardware fault which
result in tasks failure. These faults may be transient or 9)Let F(a) =< f 1(a), f 2(a), f 3(a) > and
permanent and are independent. Each independent fault F(b) =< f 1(b), f 2(b), f 3(b) > be the vector values of
results in the failure of only one processor.
the cost function F for solutions a and b respectively.
Basic Terminologies:
Then, a dominates b if fi ( a ) fi (b) for all
1) The feasible schedule S ensures that tasks
constraints between tasks of all tasks are met. A partial i (i = 1,2,3) and either f 1(a ) < f 1(b) or
schedule is one which does not contain all tasks. f 2( a ) < f 2(b) or f 3(a ) < f 3(b) .
2) For a task vi , St (vi ) and ft (vi ) are scheduled start 10) Reliability Probability of Processor: The reliability
time and scheduled finish time respectively. For a probability of processor p during a time interval t is
p t [9, 10]
processor p j , St ( p j ) and ft ( p j ) are the processors e . Under a task allocation S, the time required
start time (the time it takes to run a task) and FINISH time to execute all the tasks assigned to processor p
(the time it takes to complete a task) respectively.

N
is X ip cos t (i, p) , if task vi is scheduled on
3) For a task vij , p j is its scheduled processor. i =1

4) For a communication ei , j , comu i , j is the processor p , then X ip = 1, otherwise X ip = 0, then


the corresponding processor reliability can be formulated
communications delay between task vi and v j , if TASKS as formula (7);
vi and v j are scheduled on different processors, that is
i=1 X ip Eip
N
p
PRp ( S ) = e
(6)
comu (i, j ) = data (i, j ) d ( pk , ph ) (1)
11) Reliability probability of path e pq during a time
where task vi is mapped onto processor pk , task v j is t
interval t is e pq [9, 10]. Under a task allocation S, the
mapped onto processor ph , and d ( pk , ph ) is the time time required for data communication between the
required to send a unit length data from pk to ph . terminal processors p and q

N
5) Processor p j ' s ready time R ( p j ) is its available is i =1 j i
Xip X jq(dataij d( p, q)) ; then, the
time when it runs a task. corresponding path reliability can be given by formula (7);

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1369

pq i=1ji Xip X jq (datai, j


N
d ( p,q)) and operators which are used in Standard Genetic
PRpq(S) = e Algorithm are used in the CPGA algorithm.
(7)
SGA algorithm is started with an initial population of
12) Systems reliability probability with the task feasible solutions. Therefore, by applying some operators,
allocation S is computed as follows: the best solution can be obtained after some generations.
The selection of the best solution is determined according
P P
to the value of the fitness function. According to this, the
R(S) = Rp (S)Rpq(S) = ecount( X ) chromosome is divided into two sections, the mapping
p=1 p=1 q p and the scheduling sections. The mapping section
(8)
contains the processor indices where tasks are to be run.
N P
The scheduling section determines the sequence for
count ( S ) =
i =1 p =1
p X ip E ip + processing the tasks.Figure2 shows an example of such
representation of the chromosome.
N P Cij
X X (W )
i=1 ji p=1 qp
pq ip jq
pq
(9)
The first term of the function count ( S ) reflects the
unreliability caused by the execution of tasks on
processors of various reliabilities, and the second term
reflects the unreliability caused by the inter-processor
communication through different paths of various
reliabilities.
Maximizing the system reliability is equivalent to
minimizing count ( S ) .

B. The Heterogeneous-Earliest-Finish-Time (HEFT)


Algorithm
The HEFT algorithm has two major phases: a task
prioritizing phase for computing the priorities of all tasks
and a processor selection phase for selecting the tasks in
the order of their priorities, thereby scheduling each
selected task on its best processor, which minimizes the
tasks finish time. Figure 1. A task graph and the computation time on different
Task Prioritizing Phase: This phase requires the processors
priority of each task to be set with the upward rank value,
ranku which is based on mean computation and mean
communication costs. The task list is generated by sorting
Figure 2. Chromosome encoding
the tasks by decreasing order of ranku .It can easily be
shown that the decreasing order of ranku values The same principles and operators which are used in
provides a topological order of tasks, which is a linear SGA algorithm have been used in the CPGA algorithm.
order that preserves the precedence constraints. The encoding of the chromosome is the same as in SGA,
Processor Selection Phase: The HEFT algorithm but, in the initial population, the second part (schedule) of
schedules the task on the processor on which the task has the chromosome can be constructed using ALAP [16].
the earliest finish time. In CPGA, three modifications have been applied in the
SGA to improve the scheduling performance. These
The upward rank of a task vi is recursively defined by modifications are:
formula (11) and (12). Reuse idle time:
The idle time of the processor is used to assign some
ranku (ni ) = wi + max(wi, j + ranku (n j )) tasks to idle time slots
n j succ(ni )
(11) Priority of the CPNs:
According to the modification, the initial population is
produced using the following steps:
ranku (nexit ) = wexit (12) Initially, the entry task is the selected task and it is
marked as a critical path task. An immediate task is
C. The Critical Path Genetic Algorithm (CPGA) marked as a critical path task. An immediate successor
(of the selected task) that has the highest priority value is
The CPGA algorithm is considered as a hybrid of SGA selected and it is marked as a critical path task. This
principles and heuristic principles. The same principles process is repeated until the exit node is repeated. In each

2012 ACADEMY PUBLISHER


1370 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

generation of population, the critical path task is 7) Repeat steps 1 to 6 using the number of initial
scheduled as early as possible. population.
Load balance:
B. HEFT-NSGA algorithm
The aim of load balance modification is to obtain the
minimum schedule length and, at the same time, satisfy The initialized population is sorted based on non-
the load balance. domination sort into each front. The first front being
completely non-dominated set in the current population,
III. HEFT-NO DOMINATED SORTING GENETIC and the second front being dominated by the individuals
ALGORITHM (HEFT-NSGA) in the first front only and the front goes on and on. Each
individual in the fronts are assigned rank (fitness) values
Task scheduling is also a class of optimization based on the front which they belong to. Individuals in
problems. Existing scheduling algorithms handle task first front are given a fitness value of 1 and individuals in
scheduling as a single objective optimization. But in second are assigned fitness value as 2 and so on.
many practical applications, multi-objectives need to be In addition to fitness value, a new parameter called
optimized at the same time. crowding distance is calculated for each individual. The
Evolutionary algorithm has successfully been applied crowding distance is a measure of how close an
to the field of multi-objective optimization. In order to individual is to its neighbors. Large average crowding
achieve global search evolutionary algorithm, maintain distance will result in better diversity in the population.
the composition of the population of potential solutions Parents are selected from the population by using
between generations this means that, population to binary tournament selection based on the rank and
population is effective for searching best solutions to crowding distance. An individual selected has either its
multi-objective optimization problems. rank lesser than the other or its crowding distance greater
In the case of multiple objectives, there may not be one than the other. The selected population generates
solution which is best in comparison to all other offspring from crossover and mutation operators.
objectives. In a typical multi-objective optimization The current parent population and current offspring is
problem, there exist a set of solutions which are superior sorted again based on non-domination and only the best
to the rest of the solutions in the search space when all N individuals are selected, where N is the population size.
objectives are considered, but they are inferior to other The selection is based on rank and the crowding distance
solutions in the space in one or more objectives. These on the last front.
solutions are known as Pareto-optimal solutions or non- Step 1: Initialize the parameter and encode the
dominated solutions. The rest of the solutions are known chromosome;
as dominated solutions. Since none of the solutions in the Step 2: Generate Initial Population
non-dominated set is absolutely better than any other, any Step 3: Non-Dominated sort: The initialized population
one of them is an acceptable solution. is sorted based on non-domination;
NSGA- is by far one of the best evolutionary multi- Step 4: While stop criterion is not satisfied, do begin.
objective optimization algorithm[11]. 4.1) Pnew Pcurrent
The developed HEFT-NSGA algorithm is considered
as a hybrid of the NSGA and the heuristic principles .On 4.2) repeat for (
NP ) times,
the other hand, the same principles and operators which 2
are used in NSGA algorithm are also used in the HEFT- PDad select ( Pnew );
NSGA algorithm .The encoding of the chromosome is the
same as in SGA. Pmom select ( Pnew ) ;
A. Generating initial population Pnew crossover ( PDad , Pmom ) ;
By performing the following steps, chromosomes End repeat;
would be created. 4.3) for each chromosome Pnew do begin
1) Set st( m)(processors start processing time)=0 Mutate (chromosome);
ft(m)(processors finish processing time)=0; End for.
2) Select a task vi whose predecessors are scheduled; 4.4)Non-Dominated sort ( Pcurrent , Pnew ).
3) Select the processor in which task vi has the Step 5: Return the best N p chromosomes
earliest finish time, however, if two or more processors The non-dominated fast sort algorithm is described as
have earliest finish times, select one of them in random; below.
4) if task vi scheduled on processor pk , set For each individual p in population P, do the following:
ft ( pk ) = ft (vi ) ; Initialize S p = . (This set contains all individuals
5) Repeat steps 2 to 4 until all tasks are scheduled and that are dominated by p .)
new chromosomes are generated; Initialize n p = 0 . (This is the number of individuals
6) check the tasks in the chromosomes weather they
satisfy the logical requirements; that dominate p .)

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1371

For each individual q in P , Where, I ( k ) m is the value of the mth objective


If p dominated q then, th
function of the k individual in I .
Add q to the set S p i.e. S p = S p {q} , The basic idea behind the crowding distance is to find
the Euclidian distance between each individual in a front
Else if q dominates p then
based on their m objectives in the m dimensional hyper
Increase the domination counter for p i.e. space. Individuals in the boundary are always selected
np = np + 1, since they have infinite distance assignment.
Selection: Once the individuals are sorted based on
If n p = 0 i.e. no individual dominate p then p non-domination and with crowding distance assigned, the
belongs to the first front; selection is carried out using a crowded comparison-
Set rank of individual p to one, i.e. prank = 1 . operator ( n ). The comparison is carried out as follows
Update the first front set by adding p to front one, i.e. based on:
F1 = F1 { p} . 1) Non-domination rank prank i.e. individuals in front
This is carried out for all the individuals in the main Fi will have their rank as prank = i .
population P . 2) Crowding distance Fi ( d j ) , p n q ,If
Initialize the front counter to one.
While the ith front is non-empty, i.e. Fi ; prank < q rank , or if p and q belong to the same front
Set Q = . The set for storing the individuals for (i Fi then Fi ( d p ) > Fi ( d q ) i.e.
+ 1)th front. The individuals are selected by using a binary
for each individual p in front Fi ,for each individual tournament selection with crowed-comparison- operator.
Since standard genetic algorithm may require some
q in S p ( S p is the set of individuals dominated by
time to find an ideal result, it is necessary to modify some
p ), nq = nq 1 , decrease the domination count for principles .In HEFT-NSGA, once a task is selected to
schedule on a processor, some steps are altered, and the
individual q .
pseudo code of this algorithm is as follows:
If nq = 0 , then none of the individuals in the 1) RT [ Pj ] = 0 , and RT is the ready time of the
subsequent fronts would dominate q. processors;
Hence, set qrank = i + 1 . 2) Let LT be a list of tasks according to the topological
Update the set Q with individual q i.e. Q = Q q . order of DAG;
3) For i=1 to m, and m is the number of tasks in DAG;
Increase the front counter by one. a ) Remove the first task ti from list LT,
Now the set Q is the next front and hence Fi = Q . b ) For j=1 to n ,and n is number of Processors,
Crowding Distance: Once the non-dominated sort is If Pj can make task vi complete as early as possible,
complete, a crowding distance is assigned; the individuals scheduled vi on p j ,
are selected based on their ranks, and all individuals in
the population are assigned a crowding distance value. ST [vi ] = max{RT [ p j ], DAT (i, j )}
Comparing the crowding distance between two
individuals in different front is meaningless. The crowing FT [vi ] = ST [vi ] + wi , j
distance is calculated as follows; RT [ p j ] = FT [vi ]
For each front Fi , n is the number of individuals. END If
Initialize the distance to zero for all the individuals i.e. END For
Fi (d j ) = 0 , where j corresponds to the jth END For

individual in front Fi . For each objective function m, IV. EXPERIMENTAL RESULTS AND DISCUSSION
sort the individuals in front Fi based on objective m i.e. In this section, we compare the performance of the
I = sort ( Fi , m) .Assign infinite distance to boundary HEFT-NSGA algorithm with two well-known scheduling
algorithms in heterogeneous distributed system: the
values for each individual in Fi . I (d1 ) = and HEFT and CPGA algorithms. We consider two sets of
graphs as the basis for testing the algorithms: randomly
I (d n ) = generated application graphs and graphs that represent
For k = 2to ( n 1) , some of the numerical real world problems.

I(k +1) m I(k 1) m A. comparison metrics


I(dk ) = I(dk ) + (13) Comparisons of the algorithms are based on the
fmmax fmmin following three metrics:

2012 ACADEMY PUBLISHER


1372 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Makespan or scheduling length is defined as: The performance of the algorithms was compared with
Makespan = EFT ( vexit ), respect to various graph characteristics. The first set of
experiments compares the performance of the algorithms
Where EFT ( vexit ) is the earliest finish time of the with respect to various CCR and graph size. The results
schedule exit task. are shown in Fig.3-5.According to the results ,when the
Schedule Length Ratio [SLR]. The main performance CCR<1 the SLR-based ranking of the algorithms is
measure of a scheduling algorithm on a DAG graph is the {HEFT-NAGS,CPGA,HEFT},when the CCR>1 the
schedule length (makespan) of its output schedule. Since SLR-based ranking of the algorithms is {HEFT-
a large set of application graphs with different properties NSGA,HEFT,CPGA}.We also observe from results that
are used, it is necessary to normalize the schedule length HEFT-NSGA outperforms CPGA and HEFT algorithms
to the lower bound, which is called the Schedule Length in terms of the makespan, SLR, and reliability probability.
Ratio (SLR). The SLR is defined as By multiobjective genetic algorithm, HEFT-NSGA trying
to find solutions that the schedule system can reach a
makespan better balance than other algorithms between objectives.
SLR = (14)
min{cos t (vi )}
vi CPmin
We also can observe from Fig 4 and Fig 5 that from a
certain indicators, HEFT-NSGA have improved not so
obvious, but together, all indicators are better than other
Where, CPmin is the critical path of the DAG when the algorithms to improve.
task node weights are evaluated as the minimum D. Application Graphs of Real Word Problems
computation cost among the eligible processors. Using real applications to tests the performance of
Reliability probability: The application reliability algorithms is very common [4,5,13,16,14 ,15].Hence, in
probability can be evaluated by the reliability of the exit addition to randomly generated DAGs, we also simulated
task, and is defined as follows two real-word problems: Gaussian elimination[4 5
Reliability probability = p[Vexit ] . 1316],Fast Fourier transformation(FFT)[5,14].
For the experiments of Gauss elimination application,
B. Randomly generated application graph the same CCR and range percentage values(given in
In this study, we first considered the randomly Section 5.2) were used. Since the structure of the
generated application graphs such that, a random graph application graph is known, we do not need the other
generator was implemented to generate weighted parameters. A new parameter, matrix size(l),is used in
application DAGs with various characteristics that place of m(the number of tasks in the graph).The total
depended on several input parameters. The simulation number of tasks in a Gaussian eliminateion graph is equal
based framework allows assigning set of values to the
parameters used in the random graph generator.
l2 + l 2 [5]
to .
For the generation of random graphs which are 2
commonly used to compare scheduling algorithms [45, For the comparison of SLR and the reliability
12,13], five fundamental characteristics of the DAG are probability, the matrix size used in the experiments is
considered: varied from 6 to 18, with an increment step of 2, and the
DAG size m :The number of tasks in the number of processors is set to 4.The average SLR and
application DAG reliability probability produced by each scheduling
Communication to computation cost ratio (CCR): this algorithm related to matrix size are shown in Fig 6.From
is defined as the ratio of the average communication cost Fig 6 we also observe that HEFT-NSGA outperforms
to the average computation cost. CPGA and HEFT algorithms significantly.
Computational cost heterogeneity factor, h : higher The FFT algorithm consists of two parts: recursive
calls and the butterfly operation. The task graph can be
h value indicates higher variance for the computation divided into two parts recursive call tasks and butterfly
cost of a task, with respect to the processor in the system operation tasks. For an input size of vector ,there are
and vice versa [5]. recursive call tasks and butterfly operation tasks. Each
In all the experiments, only graphs with a single entry path from the start task to any of the exit tasks in an FFT
node and a single exit node were considered, as the input task graph is a critical path since the computation costs of
parameters were restricted to the following values: tasks in any level are equal and the communication costs
v {20,40,60,80,100,120 } of all edges between two consecutive levels are equal[5].
h {0.5,1.0,2.0 } For the FFT-related experiments, only the CCR and
range percentage parameters, among the parameters given
CCR {0.5,1.0,1.5,2.0 } in section D,were used , as in the Gauss elimination
application. According to the values of CCR we want, we
C. Random application performance results generate DAGs with different number of tasks .Fig 7
The goal of these experiments is to compare the demonstrates that NSGA-HEFT algorithms outperforms
proposed HEFT-NSGA algorithm with the other two CPGA and HEFT algorithms in terms of makespan ,SLR
algorithms, HEFT and CPGA. and reliability probability.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1373

V. CONCLUSION [6] Wu, A.S., H. Yu, S. Jin, K.-C. Lin, and G. Schiavone, 2004.
An Incremental Genetic Algorithm Approach to
With the development of parallel computing, Multiprocessor Scheduling, IEEE Trans. Parallel and
distributed system applications have been greatly Distributed Systems, 15: 824-834.
expanded. In some applications, scheduling system does [7] Kwok, Y. and I. Ahmad, 1999,Static Scheduling
not require only the fastest scheduling task, since the Algorithms for Allocating Directed Task Graphs to
scheduling result is required to ensure the systems Multiprocessors, ACM Computing Survey, 31: 406-471.
reliability probability to maximize at the same time. [8] Fatma A.Omara,Mona M.Arafa,Genetic algorithms fo task
scheduling problem.Journal of Parallel and Distributed
Often, existing algorithms do not take into account the
Computing,70(2010):13-22
mandate of the earliest completion time and systems [9] Attiya, G., Hamam, Y., 2006. Task allocation for
reliability, since scheduling results obtained by these maximizing reliability of distributed systems: a simulated
algorithms may be outstanding in one aspect, but not annealing approach. Journal of Parallel and Distributed
ideal for other areas. In this paper, multi-objective Computing 66, 12591266
evolutionary algorithm and heuristic algorithm combined [10] Shatz, S.M., Wang, J.P., Goto, M., 1992. Task allocation
to make multiple objectives simultaneously optimized. for maximizing reliability of distributed computer systems.
The performance of HEFT-NSGA is compared to two of IEEE Transactions on Computers 41, 11561168.
the existing scheduling algorithms: the HEFT and CPGA [11] Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and
elitist multi-objective genetic algorithm: NSGA-II. IEEE
algorithms. The comparison is based on both randomly
Trans. On Evolutionary Computation, 2002,6(2):182197.
generated application DAGs and two real-world Gaussian [12] A. Dogan, F. zgner, Matching and scheduling
elimination problems, and fast Fourier algorithms for minimizing execution time and failure
transformation .This simulation experiment results show probability of applications in heterogeneous computing,
that HEFT-NSGA algorithm outperforms both HEFT and IEEE Trans. Parallel Distrib. Syst. 13 (3) (2002) 308-323
CPGA algorithms in terms of scheduling [13] Mohammad I. Daoud, Nawwaf Kharma, A high
length(makespan),scheduling length ratio(SLR),and performance algorithm for static task scheduling in
reliability probability. heterogeneous distributed computing systems, J.Parallel
Distrib. Comput. 68 (4) (2008) 399 409.
[14] Y. Chung, S. Ranka, Application and performance analysis
REFERENCE
of a compile-time optimization approach for list scheduling
[1] H. El-Rewini, T.G. Lewis, Scheduling parallel program algorithms on distributed memory multiprocessors, in:
tasks onto arbitrary target machines, J. Parallel Distrib. Proc. Super Computing, 1992, pp. 512-521.
Comput. 9 (2) (1990) 138-153. [15] C.M. Woodside, G.G. Monforton, Fast allocation of
[2] M. Iverson, F. Ozuner, G. Follen, Parallelizing existing processes in distributed and parallel systems, IEEE Trans.
applications in a distributed heterogeneous environment, in: Parallel Distrib. Syst. 4 (2) (1993) 164 174
Proceedings of Heterogeneous Computing Workshop, [16] M. Wu, D. Dajski, Hypertool: a programming aid for
1995, pp. 93-100. message passing systems, IEEE Trans. Parallel Distrib.
[3] P.Y.R. Ma, E.Y.S. Lee, M. Tsuchiya, A task allocation Syst. 1 (3) (1990) 330-343
model for distributed computing systems, IEEE Trans. Yuanlong Chen was born in 1981, and received his M.S.
Comput. 31 (1) (1982) 41-47 degree in 2007. He now works in school of computer science
[4] G.Q. Liu, K.L. Poh, M. Xie, Iterative list scheduling for and technology,Harbin Institute of technology. He is a doctor
heterogeneous computing, J. Parallel Distrib. Comput. 65 and He is engage mainly in systems engineering and parallel
(5) (2005) 654-665 computing.
[5] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective
and low complexity task scheduling for heterogeneous
computing, IEEE Trans. Parallel Distrib. Syst.13 (3) (2002)
260-274.

Figure 3. MakespanSLR and Reliability probability of HEFT ,CPGA and HEFT-NSGA

2012 ACADEMY PUBLISHER


1374 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Figure 4. Makespan,SLR and Reliability probability of HEFT ,CPGA and HEFT-NSGA for CCR=0.4

Figure 5. Makespan, SLR and Reliability probability of HEFT, CPGA and HEFT-NSGA for CCR=1.2

Figure 6. Makespan SLR and Reliability probability of CPGA,HEFT and HEFT-NSGA for Gaussian elimination

Figure 7. Makespan SLR and Reliability probability of CPGA,HEFT and HEFT-NSGA for FFT

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1375

Dominancebased Rough Intervalvalued Fuzzy


Set in Incomplete Fuzzy Information System
Minlun Yan
Department of Mathematics and Applied Mathematics, Liangyungang Teachers College, Liangyungang, 222006, P.R.
China
Email: yanminlun@163.com

Abstract The fuzzy rough set is a fuzzy generalization Therefore, how to expand the classical rough set model
of the classical rough set. In the traditional fuzzy rough in complex information systems has become a necessity.
model, the set to be approximated is a fuzzy set. This paper Presently, many generalizations of the rough set model-
deals with an incomplete fuzzy information system with
interval-valued decision by means of generalizing the rough s, have been proposed in different types of the information
approximation of a fuzzy set to the rough approximation of systems. For example, by considering the unknown values
an interval-valued fuzzy set. Since all condition attributes are in the information system (i.e. incomplete information
considered as criteria in such incomplete fuzzy information system), many researchers have proposed different types
system, the interval-valued fuzzy set is approximated by of binary relations (similarity relation [36][38], tolerance
using the information granules, which are constructed on
the basis of a dominance relation. By the proposed rough relation [20], [23], limited tolerance relation [40] and
approximation, the at least and at most decision rules so on) for classification purpose and constructing of the
can be generated from the incomplete fuzzy information rough approximations [11], [16][21], [23], [24], [34],
system with interval-valued decision. To obtain the optimal [37], [38], [40], [43], [44]. By considering the linguistic
at least and at most decision rules, the concepts of the terms (i.e. fuzzy sets) of the attributes values, the rough
lower and upper approximate reducts, the relative lower
and upper approximate reducts for an object are proposed set model can also be generalized to different fuzzy
in the incomplete fuzzy information system with interval- environments, i.e. fuzzy rough approaches [4][6], [10],
valued decision. The judgement theorems and discernibility [15], [26], [39], [42], [46]. Moreover, since the original
matrixes associated with these reducts are also obtained. rough set approach is not able to discover inconsistencies
Some numerical examples are employed to substantiate the coming from consideration of criteria, that is, attributes
conceptual arguments.
with preference-ordered domains (scales), such as product
Index Terms incomplete fuzzy information system, quality, market share, and debt ratio, Greco et al. have
interval-valued fuzzy set, dominance relation, rough set proposed an extension of the Classic Rough Sets Ap-
theory, knowledge reduction, decision rule
proach, which is called the Dominance-based Rough Sets
Approach (DRSA) [1][3], [7], [8], [11][14], [34], [45].
I. I NTRODUCTION This innovation is mainly based on substitution of the
indiscernibility relation by a dominance relation. Greco et
Rough set theory [27][33], after a rocky start in the al. also generalized the DRSA to the fuzzy environment
last stage of twentieth century, both in theoretic investi- in Ref. [10].
gations and practical applications, has received more and In the incomplete information system, the set to be
more attentions by many researchers all over the world. In approximated is a crisp subset of the universe, which is
recent years, the rough set theory has been demonstrated induced from the partition determined by the decision
to be useful in many fields such as Artificial Intelligence, attributes (decision class). In the DRSA, the sets to
Automatic knowledge Acquisition, Data Mining, Pattern be approximated are upward and downward unions of
Recognition and so on. the decision classes. On the other hand, in the fuzzy
In the traditional rough set model, the lower and upper rough model, the set to be approximated is tend to be
approximations were introduced with reference to an a fuzzy set on the universe of discourse. Moreover, it
indiscernibility relation [27] (reflexive, symmetric, tran- should be noticed that by generalizing the fuzzy rough
sitive), which is assumed to be an equivalence relation. approach, Ref. [9] proposed an extension of the fuzzy
Such approximations can only be used to deal with the rough set model which is used to approximate an interval-
information system in which the values of attributes valued fuzzy set. However, such approximations of the
are assumed to be nominal data, i.e. symbols. In many interval-valued fuzzy set are only constructed in Pawlaks
practical applications, however, the situations may be approximate space (indiscernibility relation is used for
more complex because the complicated or mixed data. classification purpose).
From discussion above, the purpose of this paper is to
This work is supported by the Natural Science Foundation of China investigate a complex information system, which is called
(No. 61100116), Natural Science Foundation of Jiangsu Province of
China (No. BK2011492), Natural Science Foundation of Jiangsu Higher the incomplete fuzzy information system with interval-
Education Institutions of China (No. 11KJB520004). valued decision. Such a system has the following four

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1375-1384
1376 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

characteristics: II. D OMINANCE BASED ROUGH SET MODEL IN FUZZY


INFORMATION SYSTEM

It is a fuzzy system because it formulates of a Definition 1: A fuzzy set Fe defined on an universe U


problem with fuzzy samples (samples containing may be given as
fuzzy representations); Fe = {< x, Fe (x) >: x U } (1)
It is an incomplete information system because some
objects have unknown values on some of the condi- where Fe : U [0, 1] is the membership function of
tion attributes; Fe. The membership value Fe (x) describes the degree of
All condition attributes in the incomplete fuzzy in- belongingness of x U in Fe.
formation system with interval-valued decision are A fuzzy information system represents the formulation
considered as criteria; of a problem with fuzzy samples (samples containing
The set to be approximated in the incomplete fuzzy fuzzy representations). A fuzzy information system can
information system with interval-valued decision is be denoted by a pair I =< U, AT > where U is a non-
an interval-valued fuzzy set. empty finite set of objects, it is called the universe, AT
is a non-empty finite set of attributes.
A fuzzy decision table is a fuzzy information system
Obviously, the incomplete fuzzy information system with
D =< U, AT d >, where d / AT . d is an attribute
interval-valued decision is a generalization of the incom-
called a decision, and AT is termed the condition at-
plete and fuzzy information systems. By assuming that
tributes set.
the unknown values in such system are just missed, but
In a fuzzy decision table D, if A AT and A =
they do exist [18], [19], an expanded dominance relation
{a1 , , am } is the set of condition attributes, d is
is used for classifying objects. The lower and upper
the decision attribute, then we consider an universe of
approximations of the interval-valued fuzzy set are then
discourse U and m + 1 fuzzy sets, denoted by ae1 , , af m
presented, which are generalizations of the dominance- e defined on U by means of membership functions
and d,
based fuzzy rough set proposed by Greco in Ref. [10].
aei : U [0, 1], i {1, , m} and de : U [0, 1].
By the lower approximation of the interval-valued fuzzy
aei (x) and de(x) are used to represent the values of the
set, one can induce the at least decision rules, while
object x with respect to the condition attribute ai and the
by using the upper approximation of the interval-valued
decision attribute d respectively.
fuzzy set, the at most decision rules hidden in the
Suppose that we want to approximate knowledge con-
information system can be unravelled.
tained in de by using knowledge about {ae1 , , af m }.
Since knowledge reduction is one of the central prob- Then, the lower approximation of the fuzzy set de
lems in the rough set theory, based on the proposed rough given the information on ae1 , , af m is a fuzzy set
approximations in this paper, we further propose four App(ae1 , , af e
m , d), whose membership value for each
types of knowledge reductions, the lower (upper) approx- x U , denoted by App(f a1 ,,af e (x), is defined as [10]:
m ,d)
imate reducts and the relative lower (upper) approximate
reducts for an object in the universe. The lower (upper) App(f
a1 ,,af e (x) = infzD (x) {de(z)};
m ,d)
(2)
A

approximate reducts are minimal subsets of the condition


attributes, which preserve the lower (upper) approxima- where for each x U , DA (x) is a non-empty set such
tions of the interval-valued fuzzy set. The relative lower that
(upper) approximate reducts for an object in the universe,
DA (x)=D{fa ,,af
m}
(x)
are minimal subsets of the condition attributes, which { 1 }
preserve the membership values of the lower (upper) = y U : aei (y) aei (x), i {1, , m} ,
approximations of the interval-valued fuzzy set for such
DA (x) is the set of objects dominating x in terms of the
object. Thus, by the relative lower (upper) approximate set of condition attributes A.
reducts for an object in the universe, one can obtain the The lower approximation membership value
optimal at least (at most) decision rules supported by App(fa1 ,,af e (x) can be interpreted as an at least
m ,d)
such object. decision rule:
To facilitate our discussion, we first present the con- af1 (y) af1 (x) af2 (y) af2 (x)
cepts of fuzzy information system and dominance-based
af (y) af (x) de(y) App(f
a1 ,,af e (x).
fuzzy rough set in Section 2. We then propose the m m m ,d)

rough approximations in the incomplete fuzzy information


system with interval-valued decision in Section 3. The Similarity, the upper approximation of de giv-
concepts of the lower and upper approximate reducts, the en the information on ae1 , , af m is a fuzzy set
relative lower and upper approximate reducts for an object App(ae1 , , af
m , e
d), whose membership value for each
are laid out in Section 4. We also present the practical x U , denoted by App(f a1 ,,af e
m ,d)
(x), is defined as [10]:
approaches to compute these four types of reducts. We
then summarize our paper in Section 5. App(f
a1 ,,af e (x) = supzD (x) {de(z)};
m ,d)
(3)
A

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1377


where for each x U , DA (x) is a non-empty set such Different from the traditional dominance relation pro-
that posed by Greco in Ref. [12], the dominance relation
D(A) is reflexive butin generaldoes not need to be
DA (x)=D{fa ,,af
m}
(x) symmetric or transitive. Thus, D(A) is a binary relation
{ 1 }
= y U : aei (y) aei (x), i {1, , m} , which satisfies


DA (x) is the set of objects dominated by x in terms of D(A) = D({ai }), i {1, , m}; (5)
the set of condition attributes A. ai
The upper approximation membership value A1 A2 D(A1 ) D(A2 ). (6)
App(fa1 ,,af e (x) can be interpreted as an at most
m ,d)
decision rule:
By D(A), one can define the following two sets for
af1 (y) af1 (x) af2 (y) af2 (x) each x U :
afm
(y) afm
(x) de(y) App(f a1 ,,af e (x).
m ,d) the set of objects may dominate x in terms of the set
of condition attributes A, i.e.
[ ]
App(ae1 , , af e
m , d), App(ae1 , , a
f e
m , d) is referred
DA (x) = D{f
a1 ,,af
m}
(x) = {y U : (y, x) D(A)},
to as a pair of rough set of the fuzzy set de by using (7)
knowledge about {ae1 , , af m } in terms of the dom-
inance principle. For more details about properties of the set of objects may be dominated by x in terms
[App(ae1 , , af e
m , d), App(ae1 , , af e
m , d)], we refer the of the set of condition attributes A, i.e.
readers to Ref. [10].

DA (x) = D{f
a1 ,,af
m}
(x) = {y U : (x, y) D(A)}.
III. D OMINANCE - BASED ROUGH SET APPROACH TO (8)
INCOMPLETE FUZZY INFORMATION SYSTEM WITH Since by the decision attribute d, the set to be approx-
INTERVAL - VALUED DECISION imated is an interval-valued fuzzy set [d],f x U , let
+
A. Rough Approximation of Interval-valued Fuzzy Set us denote f (x) and f (x) by the lower and upper
[d] [d]
limits of the object x with respect to the decision attribute
In this section, what will be discussed is a complex
d respectively with the condition f (x) [d]
+
f (x).
decision table which is called the Incomplete Fuzzy In- [d]
formation System with Interval-valued Decision (IFISID). Moreover, x, y U , let us denote by

f (y) f (x) = f (y), f (x) =
+
Such a decision table is still denoted without confusion by [d]
f (x) = [d]
[d] [d] [d]
D =< U, AT d >. However, it should be noticed that the +
f (y);
[d]
incomplete fuzzy information system with interval-valued
f (x) [d]
f (y) f (x) f (y), f (x)
+
decision is different from the traditional fuzzy decision [d]
[d] [d] [d]
table because the following reasons: +
f (y);
[d]
Precise values of some objects on the fuzzy attributes [d] f (y) [d]
f (x) < [d] f (x) [d] f (x) =
f (y), [d]
are not known, i.e. unknown values. In this paper, the [d]
f (y).
special symbol * is used to express the unknown f = [ (x), + (x)] is de-
The complementary of [d] f f
value. Moreover, we assume here that the unknown [d] [d]
C C
value is just missed, but it does exist. By such f where [d]
noted by [d] f = [1+ (x), 1 (x)].
f f [d] [d]
explanation, the unknown value * is considered as
to be comparable with any real value in the domain Similar to the fuzzy set theory [47], the operators
of the corresponding attribute. , , of the interval-valued fuzzy sets are defined as
follows. Suppose that [d g g
1 ], [d2 ] are two different interval-
The set to be approximated in the IFISID is not
a fuzzy set, but an interval-valued fuzzy set. The valued fuzzy sets induced by two different decisions d1
membership function of such interval-valued fuzzy and d2 , then
set is [d]f : U I[0, 1] where I[0, 1] is the the set
g
[d g
1 ] [d2 ] [d
g (x) [d g (x) for each x U ;
1] [ 2]
of all closed subintervals of the interval [0, 1]. g g (x) = min{f (x), g (x)},
[d1 ][d2 ]
] [d ] [d 2]
Since the existence of unknown values, the traditional + +
min{ g (x), g (x)} ;
dominance relation should be generalized. [d1 ] [d2 ] [
g g (x) = max{ (x), (x)},
Definition 2: [34] Let D be an IFISID, A = [d1 ][d2 ]
] g
[d 1]
g
[d 2]
+ +
{a1 , , am } AT , the dominance relation in terms of max{ g (x), g (x)} .
[d1 ] [d2 ]
A is defined as: Definition 3: Let D be an IFISID, A =
{ {a1 , , am } AT , the lower approximation of the
D(A) = (x, y) U 2 : aei (x) aei (y)
} interval-valued fuzzy set [d] f given the information
aei (x) = aei (y) = (4)
on ae1 , , af m is an interval-valued fuzzy set
where i {1, , m}. [App(ae1 , , af f
m , [d])], whose membership value

2012 ACADEMY PUBLISHER


1378 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE I.
for each x U , denoted by [App(f
a1 ,,af f (x), A N EXAMPLE OF INCOMPLETE FUZZY INFORMATION SYSTEM WITH
m ,[d])]
where INTERVAL - VALUED DECISION .

[App(f U a1 a2 a3 a4 d
a1 ,,af f (x)
m ,[d])]
[ ] x1 0.9 * 0.2 0.7 [0.5, 0.7]
= f (x), + f (x) x2 0.9 0.2 0.2 0.1 [0.8, 1.0]
[App(f a1 ,,af
m ,[d])] a1 ,,af
[App(f m ,[d])]
[
] x3 0.1 0.1 0.1 0.9 [0.0, 0.3]
= infzD (x) { f (z)}, infzD (x) {+ f (z)} , (9) x4 0.0 0.9 * 0.8 [0.2, 0.5]
A [d] A [d]
x5 0.1 0.1 1.0 0.8 [0.4, 0.7]
the upper approximation of the interval-valued fuzzy
f given the information on ae1 , , af
set [d] m is an
x6 * 0.2 0.9 0.1 [0.3, 0.6]
x7 0.0 0.1 0.9 0.2 [0.0, 0.2]
interval-valued fuzzy set [App(ae1 , , af
m , f
[d])], whose x8 0.9 0.9 0.1 1.0 [0.6, 0.9]
membership value for each x U , denoted by x9 0.8 0.4 1.0 1.0 [0.9, 1.0]
[App(f f (x), where
a ,,af ,[d])] x10 0.0 1.0 1.0 * [0.1, 0.4]
1 m

[App(fa ,,af f (x)


m ,[d])]
[ 1 ]
= + f= [0.5,0.7]
+ [0.8,1.0] + [0.0,0.3] + [0.2,0.5] + [0.4,0.7]
[App(f a1 ,,af f (x), [App(f
m ,[d])] a1 ,,af f (x)
m ,[d])]
[d] x1 x2 x3 x4 x5 +
[ ] [0.3,0.6] [0.0,0.2] [0.6,0.9] [0.9,1.0] [0.1,0.4]
= supzD (x) { f (z)}, supzDA
(x) {[d]
+
f (z)} .(10) x6 + x7 + x8 + x9 + x10 .
[ A [d]
f f ]
By Definition 3, we obtain the following lower and
The pair [App(ae1 , , af m , [d])], [App(ae1 , , a
fm , [d])] upper approximations of [d]: f
is referred to as the rough approximation of the interval- f
[App(ae1 , ae2 , ae3 , ae4 , [d])] = [0.5,0.7] + [0.3,0.6] +
valued fuzzy set [d] f by using the knowledge about x1 x2
[0.0,0.3] [0.1,0.4] [0.4,0.7] [0.1,0.4] [0.0,0.2] [0.6,0.9]
+ x4 + x5 + x6 + x7 + x8 +
{ae1 , , af
m }, i.e. rough interval-valued fuzzy set in x3
[0.9,1.0]
terms of the dominance principle in the incomplete x9 + [0.1,0.4]
x10 ,
environment. f
[App(ae1 , ae2 , ae3 , ae4 , [d])] = [0.8,1.0] + [0.8,1.0] +
x1 x2
Remark 1: [0.0,0.3]
+ [0.3,0.6]
+ [0.4,0.7]
+ [0.8,1.0]
+ [0.0,0.2]
+ [0.6,0.9]
+
x3 x4 x5 x6 x7 x8
If for each x U , f (x) = [0.9,1.0] [0.3,0.6]
+ x10 .
a1 ,,af
[App(f m ,[d])] x9
[App(f a1 ,,af f (x), then the interval-valued fuzzy By the above results, we can derive the following initial
m ,[d])]

set [d] f is definable in the IFISID. Otherwise, it is decision rules from Table 1:
undefinable. at least decision rules:
f is an ordinary fuzzy set on universe U , then
If [d]
r1 : af1 (y) 0.9 af2 (y) af3 (y) 0.2
[App(ae1 , , af f f f (y) 0.7 [d] f (y) [0.5, 0.7] // supported by x1
m , [d])] and [App(ae1 , , af a
m , [d])] 4

would degenerate to be the ordinary lower and upper r 2 : af1


(y) 0.9 af2 (y) 0.2 af3 (y) 0.2
approximate fuzzy sets in terms of the dominance af4
(y) 0.1 f
[d]
(y) [0.3, 0.6] // supported by x2
principle in the incomplete environment. r 3 : af1
(y) 0.1 af2
(y) 0.1 af3 (y) 0.1
By the lower and upper approximations of the interval- af4
(y) 0.9 f
[d]
(y) [0.0, 0.3] // supported by x3
f
valued fuzzy set [d], one can induce the corresponding r4 : af1 (y) 0.0 af2 (y) 0.9 af3 (y)
decision rules for each training example x U such that af4 (y) 0.8 [d] f (y) [0.1, 0.4] // supported by x4

at least decision rules: r5 : af1 (y) 0.1 af2 (y) 0.1 af3 (y) 1.0
af4 (y) 0.8 [d] f (y) [0.4, 0.7] // supported by x5
af1 (y) af1 (x) af2 (y) af2 (x) r6 : af1 (y) af2 (y) 0.2 af3 (y) 0.9
af (y) af (x) [d]f (y) [App(fa1 ,,af f (x); a f (y) 0.1 [d] f (y) [0.1, 0.4] // supported by x6
m m m ,[d])] 4

r7 : af1 (y) 0.0 af2 (y) 0.1 af3 (y) 0.9


at most decision rules:
af4 (y) 0.2 [d] f (y) [0.0, 0.2] // supported by x7
af1 (y) af1 (x) af2 (y) af2 (x) r8 : af1 (y) 0.9 af2 (y) 0.9 af3 (y) 0.1
(y) af (x) [d] (y) 1.0 [d] f (y) [0.6, 0.9] // supported by x8
f (y) [App(f
af af
m m a1 ,,af f (x).
m ,[d])]
4

r9 : af1 (y) 0.8 af2 (y) 0.4 af3 (y) 1.0


In this paper, the above two types of decision rules are af4 (y) 1.0 [d] f (y) [0.9, 1.0] // supported by x9
referred to as the initial at least and at most decision r10 : af1 (y) 0.0 af2 (y) 1.0 af3 (y) 1.0
rules derived from the IFISID. af4 (y) [d] f (y) [0.1, 0.4] // supported by x10
Example 1: To demonstrate the IFISID, let us consider at most decision rules:

data in Table 1, which describes a small training set r1 : af1 (y) 0.9 af2 (y) af3 (y) 0.2
with fuzzy objects. The universe of discourse is U = af4 (y) 0.7 [d] f (y) [0.5, 0.7] // supported by x1
{x1 , x2 , , x10 }. AT = {a1 , a2 , a3 , a4 } is the set of
r2 : af1 (y) 0.9 af2 (y) 0.2 af3 (y) 0.2
condition attributes and d is the decision attribute, which
af4 (y) 0.1 [d] f (y) [0.3, 0.6] // supported by x2
are used to describe such ten objects. In Table 1, the set
to be approximated is an interval-valued fuzzy set such r3 : af1 (y) 0.1 af2 (y) 0.1 af3 (y) 0.1
that af4
(y) 0.9 [d] f (y) [0.0, 0.3] // supported by x3

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1379


r4 : af1 (y) 0.0 af2 (y) 0.9 af3 (y) where aei C (i {1, , m}) is the complementary
af4 (y) 0.8 [d]
f (y) [0.1, 0.4] // supported by x4 of the fuzzy set aei such that for each x U
{
r5 : af1 (y) 0.1 af2 (y) 0.1 af3 (y) 1.0 1 aei (x) : aei (x) =
af4 (y) 0.8 [d] aei C (x) =
f (y) [0.4, 0.7] // supported by x5 : otherwise

r6 : af1 (y) af2 (y) 0.2 af3 (y) 0.9 Proof:
af4 (y) 0.1 [d]
f (y) [0.1, 0.4] // supported by x6 1) Suppose that {a1 , , am } = A. Since D(A) is


r7 : af1 (y) 0.0 af2 (y) 0.1 af3 (y) 0.9 reflexive, we have x DA (x). Thus
af4 (y) 0.2 [d]
f (y) [0.0, 0.2] // supported by x7 infzD (x) {
A f (z)} f (x),
[d] [d]
r8 : af1 (y) 0.9 af2 (y) 0.9 af3 (y) 0.1
infzD (x) {+
f (z)} f (x),
+
af4 (y) 1.0 [d]
f (y) [0.6, 0.9] // supported by x8 A [d] [d]

r9 : af1 (y) 0.8 af2 (y) 0.4 af3 (y) 1.0 hold, from which we can conclude that
af4 (y) 1.0 [d]
f (y) [0.9, 1.0] // supported by x9
[App(f f (x) [d]
a ,,af ,[d])] f (x). (12)
r10 : af1 (y) 0.0 af2 (y) 1.0 af3 (y) 1.0 1 m

af4 (y) [d]f (y) [0.1, 0.4] // supported by x10 Similarity, it is not difficult to prove that

f (x) [App(f
[d] f (x).
a ,,af ,[d])]
1 m
(13)
B. Properties of Rough Interval-valued Fuzzy Set
2) Suppose that {a1 , , am } = A. By [d g g
1 ] [d2 ],
Theorem 1: Let D be an IFISID, then we have the
for each z DA (x), we have [d g (z) g (z),
following properties: 1] [d 2]
from which we obtain that
1) Contraction and extension:
infzD (x) {
g (z)} infzD (x) { g (z)},
[App(ae1 , , af f f
m , [d])] [d] [App(ae1 , , a
f f
m , [d])];
A [d1 ] A [d2 ]

(11) infzD (x) {+


g (z)} infzD (x) { g (z)},
+
A [d1 ] A [d2 ]
2) Monotone (with the monotone of the interval-valued
i.e.
fuzzy set)
[App(f g (x) [App(f g (x)
g
[d g
1 ] [d2 ]
a1 ,,af
m ,[d1 ])] a1 ,,af
m ,[d2 ])]
(14)
[App(ae1 , , af g
m , [d1 ])] [App(ae1 , , a
f g
m , [d2 ])], holds. Similarity, it is not difficult to prove that
[App(ae1 , , af g
m , [d1 ])] [App(ae1 , , a
f g
m , [d2 ])]; [App(f a1 ,,af g (x) [App(f
m ,[d1 ])] a1 ,,af g (x).
m ,[d2 ])]
(15)
3) Monotone (with the monotone of the condition
3) Suppose that A1 = {a1 , , am } A2 =
attributes)
{a1 , , an }. By Definition 2 we have DA 1
(x)

{a1 , , am } {a1 , , an } DA2 (x) and DA1 (x) DA2 (x) for each x U .
f f Thus
[App(ae1 , , af
m , [d])] [App(ae1 , , a
f n , [d])],

[App(ae1 , , af f f infzD (x) {


f (z)} infzD (x) { f (z)},
m , [d])] [App(ae1 , , a
f n , [d])]; A1 [d] A2 [d]
infzD (x) {+
f (z)} infzD (x) { f (z)},
+
4) Multiplication and addition A1 [d] A2 [d]

g g hold, from which we can conclude that


[App(ae1 , , af
m , [d1 ] [d2 ])] =
g g [App(f f (x) [App(f f (x).
[App(ae1 , , af
m , [d1 ])] [App(ae1 , , a
f m , [d2 ])],
a1 ,,af
m ,[d])] a1 ,,f
an ,[d])]
(16)
[App(ae1 , , af g g
m , [d1 ] [d2 ])] Similarity, it is not difficult to prove that
[App(ae1 , , af g
m , [d1 ])] [App(ae1 , , a
f g
m , [d2 ])], [App(f
a1 ,,af f (x) [App(f
m ,[d])] a1 ,,f f (x).
an ,[d])]
[App(ae1 , , af g g
m , [d1 ] [d2 ])] = (17)
g g 4) Suppose that {a1 , , am } = A. For each x U ,
[App(ae1 , , af
m , [d1 ])] [App(ae1 , , a
f m , [d2 ])],
by the properties of the interval-valued fuzzy set,
[App(ae1 , , af g g
m , [d1 ] [d2 ])] we have
[App(ae1 , , af g
m , [d1 ])] [App(ae1 , , a
f g
m , [d2 ])];
{
infzD (x) { g2 (z)} = min infzDA
g1 [d]

(x) {[d]
g1 (z)},
[d]
}
A
5) Complement
infzD (x) { g (z)} ,
[d]2
{
A
[ f ] [
C
f ]
C f C, infzD (x) {+ (x) {[d]
C +
App(ae1 , , afm , [d] ) = App(ae1 , , am , [d]) g2 (z)} = min infzDA
g1 [d]
[d]

g1 (z)},
}
A
[ f ] [
C
f ] infzD (x) {+
App(ae1 , , af C C f C, g (z)} ,
m , [d] ) = App(ae1 , , am , [d]) A [d]2

2012 ACADEMY PUBLISHER


1380 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

from which we can conclude that If [App(ae1 , , af g


m , [d1 ])] =
[App(ae1 , , af g
m , [d2 ])], then the interval-valued
[App(f
a ,,af ,[d g])] (x) = min{
g][d g g
1 m 1 2
fuzzy sets [d 1 ] and [d2 ] are referred to as upper
[App(f g])] (x), [App(f
a ,,af ,[d
1 m 1
g])] (x)}.
a ,,af ,[d 1 m 2 approximate equal, which is denote by [d g g
1 ] =U [d2 ];
g g g g g
If [d1 ] =L [d2 ] and [d1 ] =U [d2 ], then [d1 ] and [d g 2]
Other formulas can be proved analogously.
5) For each x U , are referred to as rough equal, which is denote by
g
[d g
1 ] =R [d2 ].
[App(f
a1 ,,af f C (x) Theorem 2: Let D be an incomplete fuzzy information
m ,[d] )]
[ ] system with interval decision, we have
= infzD (x) { f C (z)}, infzD (x) {
+
f C (z)}
A [d] A [d]
[ g g g g g g
= infzD (x) {1 + (z)}, [d 1 ] =L [d2 ] ([d1 ] [d2 ]) =L [d1 ], [d2 ];
f
A

[d]
] g
[d g g g g g
1 ] =U [d2 ] ([d1 ] [d2 ]) =U [d1 ], [d2 ];
infzD (x) {1 f (z)} .
A [d] Proof: By 4) of Theorem 1 and Definition 4, we
Suppose that {a1 , , am } = A, thus, have

DA
(x) = D{f (x) g
[d g
1 ] =L [d2 ]
a1 ,,af
m}

= {y U : aei (y) aei (x) [App(ae1 , , af g


m , [d1 ])] = [App(ae1 , , af g
m , [d2 ])]

aei (x) = aei (y) = } [App(ae1 , , af g


m , [d1 ])] [App(ae1 , , a
f g
m , [d2 ])] =
= {y U : 1 aei (y) 1 aei (x) [App(ae1 , , af g g
m , [d1 ] [d2 ])] =
aei (x) = aei (y) = } g g
[App(ae1 , , af
m , [d1 ])] = [App(ae1 , , a
fm , [d2 ])]
= D{f (x)
a C ,,af C }
1 m g
([d g g g
1 ] [d2 ]) =L [d1 ], [d2 ].
where i {1, , m}, then
infzD (x) {1 +
f (z)} = Similar to the above progress, it is not difficult to prove
A [d]
g g g g g g
1 ] =U [d2 ] ([d1 ] [d2 ]) =U [d1 ], [d2 ].
1 supzD (x) {+f (z)} = that [d
A [d]
1 supzD (x) {[d]
+
f (z)},
{a
f C g
1 ,,a mC } IV. K NOWLEDGE REDUCTIONS OF ROUGH
infzD (x) {1
f (z)} =
INTERVAL - VALUED FUZZY SET
A [d]
1 supzD (x) {f (z)} = One fundamental aspect of rough set theory involves
A [d] the search for particular subsets of attributes, which

1 supzD (x) {[d]
f (z)}, provide the same information for classification or some
{a
f C g mC }
1 ,,a
other purposes as the full set of the condition attributes.
hold, from which we can conclude that Such subsets are called reducts. In recent years, many
[infzD (x) { fC
(z)}, infzD (x) {+ fC
(z)}] types of knowledge reductions [22], [25], [41], [42], [48],
A [d] A [d]
[ [49] have been proposed based on different types of
= 1 supzD {+ f (z)}, rough set models. In the following, based on the rough
{af C ,,ag C } (x) [d]
1 m
] approximation of the interval-valued fuzzy set proposed
1 supzD { f (z)}
{a f C gmC }
(x) [d] in the above section, we will propose the following four
[ 1 ,,a

= supzD { (z)}, types of knowledge reductions.
(x) f
{af C
1 ,,agmC }
[d]
Definition 5: Let D be an IFISID, A =
]C
supzD (x) { f (z)}
+
, {ae1 , , af
m } AT = {ae1 , , a
f n },
[d]
f f
{a
f C g mC }
1 ,,a
1) If [App(ae1 , , af
m , [d])] = [App(ae1 , , a
f n , [d])],
i.e. then A is referred to as a lower approximate consis-
[App(f f C )] (x) = 1[App(af f f (x).
tent attributes set of D; if A is a lower approximate
a ,,af ,[d] C ,,aC ,[d])]
1 m 1 m
consistent attributes set of D and no proper subset
Similarity, it is not difficult to prove that of A is the lower approximate consistent attributes
[ set of D, then A is referred to as the lower approx-
f ] [ ]
C
App(ae1 , , af C fC f C.
m , [d] ) = App(ae1 , , am , [d]) imate reduct of D;
2) If [App(ae1 , , af f
m , [d])] = [App(ae1 , , a
f f
n , [d])],
Definition 4: Let D be an IFISID, then A is referred to as a upper approximate consis-
g tent attributes set of D; if A is a upper approximate
If [App(ae1 , , af
m , [d1 ])] =
g consistent attributes set of D and no proper subset
[App(ae1 , , af
m , [d2 ])], then the interval-valued of A is the upper approximate consistent attributes
fuzzy sets [d g g
1 and [d2 ] are referred to as lower
] set of D, then A is referred to as the upper approx-
approximate equal, which is denote by [d g g
1 ] =L [d2 ]; imate reduct of D;

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1381

3) x U , if [App(f a1 ,,af f (x) = Similarity, it is not difficult to obtain the simplified and
m ,[d])]
[App(f f (x), then A is referred to as optimal at most decision rules which are supported by
a1 ,,f
an ,[d])]
a relative lower approximate consistent attributes x in D.
set for x in D; if A is a relative lower approximate Reducts computation can also be translated into the
consistent attributes set for x in D and no proper computation of prime implicants of a Boolean function.
subset of A is the relative lower approximate It has been shown by Skowron and Rauszer [35] that the
consistent attributes set for x in D, then A is problem of finding reducts may be solved as a case in
referred to as the relative lower approximate reduct Boolean reasoning. We will generalize this approach to
for x in D; compute the above four types of reducts in the IFISID.
4) x U , if [App(f Definition 6: Let D be an IFISID, AT =
a1 ,,af f (x) =
m ,[d])]
{a1 , a2 , , an } is the set of condition attributes,
[App(f f (x), then A is referred to as
a1 ,,f
an ,[d])] x, y U , denote
a relative upper approximate consistent attributes
set for x in D; if A is a relative upper approximate L
DAT = {(x, y) U 2 : [App(f f (x) > [d]
f (y)},
a ,,f
a ,[d])]
consistent attributes set for x in D and no proper
1 n

subset of A is the relative upper approximate


U
DAT = {(x, y) U 2 : [App(f
a ,,f f (x) < [d]
a ,[d])] f (y)},
1 n

consistent attributes set for x in D, then A is


referred to as the relative upper approximate reduct define
for x in D. {
By the above definition, we can see that the lower L {ai AT : (y, x)
/ D(ai )}: (x, y) DAT
L
DAT (x, y) =
(upper) approximate consistent attributes sets of D are AT : otherwise
subsets of the condition attributes, which preserve the
{
lower (upper) approximations of the interval-valued fuzzy U {ai AT : (x, y)
/ D(ai )}: (x, y) DAT
U
f the lower (upper) approximate reducts of D
set [d];
DAT (x, y) =
AT : otherwise
are minimal subsets of the condition attributes, which
preserve the lower (upper) approximations of the interval- where 1 i n, DAT L
(x, y) and DAT H
(x, y) are referred
valued fuzzy set [d]. f The sets of the lower (upper) to as the lower and upper approximate discernibility
approximate reducts of D are denoted by RedL (RedU ). sets for pair of the objects (x, y) respectively, DL AT =
The relative lower (upper) approximate consistent at- {DAT (x, y) : (x, y) DAT } and DAT = {DAT (x, y) :
L L U U

tributes sets for x in D are subsets of the condition (x, y) DAT U


} are referred to as the lower and upper
attributes, which preserve the lower (upper) approximate approximate discernibility matrixes of D respectively.
membership values of the interval-valued fuzzy set [d] f for Theorem 3: Let D be an IFISID, A =
x; the relative lower (upper) approximate reducts for x in {a1 , a2 , , am } AT = {a1 , a2 , , an }, then
D are minimal subsets of the condition attributes, which we have
preserve the membership values of the lower (upper) 1) A is the lower approximate consistent attributes sets
approximate interval-valued fuzzy set [d] f for x. The sets of D A DAT L
(x, y) = , (x, y) DAT L
;
of the relative lower (upper) approximate reducts for x in 2) A is the upper approximate consistent attributes sets
D are denoted by RedL (x) (RedU (x)). of D A DAT U
(x, y) = , (x, y) DAT U
;
Suppose that D is an IFISID, A = {a1 , , am } 3) x U , A is the relative lower approximate consis-
AT = {a1 , , an }, x U , tent attributes sets for x in D ADAT L
(x, y) = ,
rx : af1 (y) af1 (x) af2 (y) af2 (x) y U (x, y) DAT ; L
afn (y) afn (x) [d] f (y) [App(f a1 ,,f f (x)
an ,[d])] 4) x U , A is the relative upper approximate consis-
is the initial at least decision rule supported by x, tent attributes sets for x in D ADAT U
(x, y) = ,
then it is not difficult to observe that: y U (x, y) DAT . U

If A is a relative lower approximate consistent at- Proof:


tributes set for x in D, then the rule 1) : Suppose that (x, y) DAT L
such that A

rx : af1 (y) af1 (x) af2 (y) af2 (x) DAT (x, y) = , then by Definition 6 we have
L
af m
(y) af m
(x) [d] f (y) (y, x) D(A), y DA
(x). By Definition 3,
[App(f a1 ,,af f
m ,[d])]
(x) we obtain that [App(f a1 ,,af f (x) [d]
m ,[d])]
f (y).
is a simplified at least decision rule supported by Since A is the lower approximate consistent at-
x in D; tributes set of D, i.e. [App(f a1 ,,af f (x) =
m ,[d])]
If A is a relative lower approximate reduct for x in [App(f f (x) for each x U , we obtain
D, then the rule a1 ,,f
an ,[d])]
[App(f f (x) [d] f (y), which contradict-
rx : af1 (y) af1 (x) af2 (y) af2 (x) a1 ,,f
an ,[d])]
af (y) af (x) [d] s that [App(f f (x) > [d] f (y) because
m m f (y) a ,,f
1 na ,[d])]
[App(f f (x) (x, y) DAT
L
.
a1 ,,af
m ,[d])]
is an optimal at least decision rule supported by x : Since A AT , by 3) of Theo-
in D. rem 1, we obtain that [App(f f (x)
a ,,af ,[d])] 1 m

2012 ACADEMY PUBLISHER


1382 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012


f (x) for each x U . Suppose
U
[App(fa1 ,,f
an ,[d])]
= DAT (x, y); (21)
that A is not the lower approximate consistent U
(x,y)DAT
attributes set of D, then there must be x U such
L and H are referred to as the lower and upper
that [App(f f (x) < [App(f f (x),
a1 ,,af
m ,[d])] a1 ,,f
an ,[d])] approximate discernibility functions of D respectively,
from which we can conclude that there must be L (x) and H (x) are referred to as the relative lower
y U where [App(f a ,,f f (x) > [d]
a ,[d])] f (y) and upper approximate discernibility functions for x in D
1 n

such that (y, x) D(A), i.e. (x, y) DAT L


and respectively.
A DAT (x, y) = . From discussion above, we
L
By using Boolean reasoning techniques, we can obtain
can draw the following conclusion: x, y U , if the following Theorem 4 from Theorem 3 immediately.
(x, y) DAT L
and A DAT L
(x, y) = , then A is Theorem 4: Let D be an IFISID, A AT , then
the lower approximate consistent attributes set of 1) A is the lower (upper) approximate reduct of D
D. if and only if A is a prime implicant of the
2) : Suppose that (x, y) DAT U
such that A lower (upper) approximate discernibility function
DAT (x, y) = , then by Definition 6 we have
U
L (U );

(x, y) D(A), y DA (x). By Definition 3, 2) x U , A is the relative lower (upper)
we obtain that [d] (y) [App(f approximate
f a1 ,,af f (x).
m ,[d])] reduct for x in D if and only if A is a prime
Since A is the upper approximate consistent at- implicant of the relative lower (upper) approximate
tributes set of D, i.e. [App(f a1 ,,af f (x) = discernibility function L (x) (U (x)).
m ,[d])]
[App(fa1 ,,f f (x) for each x U , we obtain
an ,[d])] Example 2: Following Example 1, computing all of the
[d]
f (y) [App(f a1 ,,f f (x), which contradict-
an ,[d])] optimal decision rules in Table 1.
s that [App(f a1 ,,f f (x) < [d]
an ,[d])] f (y) because By Definition
{ 6, we have
L
(x, y) DAT .U DAT = (x1 , x3 ), (x1 , x4 ), (x1 , x5 ), (x1 , x6 ), (x1 , x7 ),
: Since A AT , by 3) of Theo- (x1 , x10 ), (x2 , x3 ), (x2 , x4 ), (x2 , x7 ), (x2 , x10 ), (x3 , x7 ),
rem 1, we obtain that [App(f a1 ,,af f (x)
m ,[d])]
(x4 , x3 ), (x4 , x7 ), (x5 , x3 ), (x5 , x4 ), (x5 , x6 ), (x5 , x7 ),
[App(fa1 ,,f f
an ,[d])]
(x) for each x U . Suppose (x5 , x10 ), (x6 , x3 ), (x6 , x7 ), (x8 , x1 ), (x8 , x3 ), (x8 , x4 ),
that A is not the upper approximate consistent (x8 , x5 ), (x8 , x6 ), (x8 , x7 ), (x8 , x10 ), (x9 , x1 ), (x9 , x2 ),
attributes set of D, then there must be x U such (x9 , x3 ), (x9 , x4 ), (x9 , x5 ), (x}9 , x6 ), (x9 , x7 ), (x9 , x8 ),
that [App(f f (x) > [App(f f (x),
(x9 , x10 ), (x10 , x3 ), (x10 , x7 ) .
a1 ,,af
m ,[d])] a1 ,,f
an ,[d])]
from which we can conclude that there must be By Definition 7, we obtain that L = a1 a2 a3 a4 .
y U where [App(f By Theorem 4, the set of attributes {a1 , a2 , a3 , a4 } is the
a1 ,,f f (x) < [d]
an ,[d])] f (y)
lower approximate reduct of Table 1, i.e. no condition
such that (x, y) D(A), i.e. (x, y) DAT and U
attribute is redundant in Table 1 for preserving the lower
A DAT U
(x, y) = . From discussion above, we f
can draw the following conclusion: x, y U , if approximation of [d].
(x, y) DAT U
and A DAT U
(x, y) = , then A is By Definition 7, we can also obtain the following
the upper approximate consistent attributes set of results:
D. RedL (x1 ) = {{a1 , a4 }};
3) The proofs of 3) and 4) are similar to the proofs of RedL (x2 ) = {a1 };
1) and 2) respectively. RedL (x3 ) = {a1 , a4 };
RedL (x4 ) = {a2 };
Definition 7: Let D be an IFISID, define RedL (x5 ) = {{a1 , a3 }};
RedL (x6 ) = {a2 };
L
L = DAT (x, y) RedL (x7 ) = AT ;
(x,y)U 2
RedL (x8 ) = {{a1 , a4 }};
L
= DAT (x, y); (18) RedL (x9 ) = {{a1 , a3 }};
L
(x,y)DAT RedL (x10 ) = {a2 , a3 }.

U = U
DAT (x, y) By these relative lower approximate reducts, we can
(x,y)U 2
generate all of the optimal at least decision rules from
Table 1:
U
= DAT (x, y); (19) R1 : af1 (y) 0.9 af4 (y) 0.7 [d] f (y)
U
(x,y)DAT L
[0.5, 0.7] // supported by Red (x1 )
L (x) = L
DAT (x, y) R2 : af1 (y) 0.9 [d] f (y) [0.3, 0.6] // supported
yU
by RedL (x2 )
= L
DAT (x, y); (20) R3 : af1 (y) 0.1 af4 (y) 0.9 [d] f (y)
L
L
(x,y)DAT [0.0, 0.3] // supported by Red (x3 )

U (x) = U
DAT (x, y) R4 : af2 (y) 0.9 [d] f (y) [0.1, 0.4] // supported
L
yU by Red (x4 )

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1383

R5 : af1 (y) 0.1 af3 (y) 1.0 [d] f (y) framework for the study of the dominancebased fuzzy
[0.4, 0.7] // supported by RedL (x5 ) rough set in the incomplete fuzzy information system with
R6 : af2 (y) 0.2 [d]f (y) [0.1, 0.4] // supported
interval-valued decision. In our approach, the rough ap-
L
by Red (x6 ) proximation of the interval-valued fuzzy set is constructed
R7 : af1 (y) 0.0 af2 (y) 0.1 af3 (y) 0.9 on the basis of an expanded dominance relation. Such
af4 (y) 0.2 [d] rough approximation is a generalization of the dominance-
f (y) [0.0, 0.2] // supported by
based fuzzy rough set in the fuzzy environment. Based on
RedL (x7 )
the proposed rough approximation of the interval-valued
R8 : af1 (y) 0.9 af4 (y) 1.0 [d] f (y)
fuzzy set, we also propose four types of the knowledge
L
[0.6, 0.9] // supported by Red (x8 ) reductions, lower and upper approximate reducts, relative
R9 : af1 (y) 0.8 af3 (y) 1.0 [d] f (y) lower and upper approximate reducts for an object. By
[0.9, 1.0] // supported by RedL (x9 ) the relative lower and upper approximate reducts for an
R10 : af2 (y) 1.0 af3 (y) 1.0 [d] f (y) object, one can induce optimal at least and at most
L decision rules which are supported by such object in the
[0.1, 0.4] // supported by Red (x10 )
Similarity, we obtain that U = a1 a2 a3 . By information system.
Theorem 4, the set of attributes {a1 , a2 , a3 } is the upper For further research, the proposed approach can be
approximate reduct of Table 1. Moreover, extended to more general and complex information sys-
RedU (x1 ) = {a3 , a4 }; tems such as the information system with interval-valued
RedU (x2 ) = {a2 , a3 , a4 }; domains of the condition attributes. On the other hand, the
RedU (x3 ) = {{a2 , a3 }}; rough approximation of the interval-valued fuzzy set in
RedU (x4 ) = {a1 }; the incomplete environment with some other explanations
RedU (x5 ) = {{a1 , a2 }}; of the unknown values (e.g. the unknown value is a non-
RedU (x6 ) = {a2 , a3 , a4 }; existing one) are exciting areas to be explored. We will
RedU (x7 ) = {{a1 , a2 }, {a2 , a4 }}; study these issues in our future works.
RedU (x8 ) = {a3 };
RedU (x9 ) = AT ;
RedU (x10 ) = {a1 }. R EFERENCES
Thus, we can generate the following optimal at most
decision rules from Table 1: [1] J. Baszczynski, S. Greco and R. Sowinski, On variable

R1 : af3 (y) 0.2 af4 (y) 0.7 [d] consistency dominance-based rough set approaches, Proc.
f (y)
5th Intl Conf. on Rough Sets and Current Trends in
[0.5, 0.7] // supported by RedU (x1 ) Computing (RSCTC 2006), pp. 191202, 2006.

R2 : af2 (y) 0.2 af3 (y) 0.2 af4 (y) 0.1 [2] J. Baszczynski, S. Greco and R. Sowinski, Monotonic
f (y) [0.3, 0.6] // supported by Red (x2 )
U variable consistency rough set approaches, Proc. 2nd.
[d]
Intl Conf. Rough Sets and Knowledge Technology (RSKT
R3 : af2 (y) 0.1 af3 (y) 0.1 [d] f (y) 2007), pp. 126133, 2007.
[0.0, 0.3] // supported by RedU (x3 ) [3] J. Baszczynski, S. Greco and R. Sowinski, Multi-criteria

R4 : af1 (y) 0.0 [d]
f (y) [0.1, 0.4] // supported
classification-A new scheme for application of dominance-
U based decision rules, European Journal of Operational
by Red (x4 ) Research, vol. 181, pp. 10301044, 2007.

R5 : af1 (y) 0.1 af2 (y) 0.1 [d] f (y) [4] M.D. Cock, C. Cornelis and E.E. Kerre, Fuzzy rough sets:
[0.4, 0.7] // supported by RedU (x5 ) the forgotten step, IEEE Transactions on Fuzzy Systems,

R6 : af2 (y) 0.2 af3 (y) 0.9 af4 (y) 0.1 vol. 15, no. 1, pp. 121130, 2007.
[5] D. Dubois and H. Prade, Rough fuzzy sets and fuzzy
f (y) [0.1, 0.4] // supported by Red (x6 )
U
[d] rough sets, International Journal of General Systems, vol.

R7 : af2 (y) 0.1 (af1 (y) 0.0 af4 (y) 0.2) 17, pp. 191208, 1990.
f (y) [0.0, 0.2] // supported by Red (x7 )
U [6] D. Dubois, H. Prade, Putting rough sets and fuzzy sets
[d]
together, Intelligent Decision Support, Handbook of Ap-
R8 : af3 (y) 0.1 [d]
f (y) [0.6, 0.9] // supported plications and Advances of the Rough Sets Theory, pp.
U 203232, 1992.
by Red (x8 )

R9 : af1 (y) 0.8 af2 (y) 0.4 af3 (y) 1.0 [7] T.F. Fan, D.R. Liu and G.H. Tzeng, Rough set-based log-
ics for multicriteria decision analysis, European Journal
af4 (y) 1.0 [d] f (y) [0.9, 1.0] // supported by of Operational Research, vol. 182, pp. 340355, 2007.
U
Red (x9 ) [8] P. Fortemps, S Greco and R. Sowinski, Multicriteria

R10 : af1 (y) 0.0 [d]
f (y) [0.1, 0.4] // supported decision support using rules that represent rough-graded
U preference relations, European Journal of Operational
by Red (x10 )
Research, vol. 188, pp. 206223, 2008.
[9] Z.T. Gong, B.Z. Sun and D.G. Chen, Rough set theory
V. C ONCLUSION for the interval-valued fuzzy information systems, Infor-
mation Sciences, vol. 178, pp. 19681985, 2008.
In recent years, how to expand the traditional rough set
[10] S. Greco, M. Inuiguchi and R. Sowinski, Fuzzy rough
model in different types of complex information systems sets and multiple-premise gradual decision rules, Inter-
playing an important role in the development of the rough national Journal of Approximate Reasoning, vol. 41, pp.
set theory. In this paper, we have developed a general 179211, 2006.

2012 ACADEMY PUBLISHER


1384 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[11] S. Greco, B. Matarazzo and R. Sowinski, Handing miss- [32] Z. Pawlak and A. Skowron, Rough sets: Some extension-
ing values in rough set analysis of mutiattribute and muti- s, Information Sciences, vol. 177, pp. 2840, 2007.
criteria decision problems, Proc. 7th Intl Workshop on [33] Z. Pawlak and A. Skowron, Rough sets and boolean
Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft reasoning, Information Sciences, vol. 177, pp. 4173,
Computing (RSFDGrC99), pp. 146157, 1999. 2007.
[12] S. Greco, B. Matarazzo and R. Sowinski, Rough approx- [34] M.W. Shao and W.X. Zhang, Dominance relation and
imation by dominance relations, International Journal of rules in an incomplete ordered information system, Inter-
Intelligent Systems, vol. 17, pp. 153171, 2002. national Journal of Intelligent Systems, vol. 20, pp. 1327,
[13] S. Greco, B. Matarazzo and R. Sowinski, Rough sets the- 2005.
ory for multicriteria decision analysis, European Journal [35] A. Skowron and C. Rauszer, The discernibility matrices
of Operational Research, vol. 129, pp. 147, 2002. and functions in information systems, Intelligent Decision
[14] S. Greco, B. Matarazzo and R. Sowinski, Dominance- Support, Handbook of Applications and Advances of the
Based Rough Set Approach to Case-Based Reasoning, Rough Sets Theory, pp. 331362, 1992.
Proc. 3rd Intl Conf. on Modeling Decisions for Artificial [36] R. Sowinski and D. Vanderpooten, A generalized defini-
Intelligence (MDAI 2006), pp. 718, 2006. tion of rough approximations based on similarity, IEEE
[15] Q.H. Hu, D.R. Yu, Z.X. Xie and J.F. Liu, Fuzzy Proba- Transactions on Knowledge and Data Enginerring, vol.
bilistic Approximation Spaces and Their Information Mea- 12, no. 2, pp. 331336, 2000.
sures, IEEE Transactions on Fuzzy Systems, vol. 14, no. [37] J. Stefanowski and A. Tsoukias, On the extension of
2, pp. 191201, 2006. rough sets under incomplete information, Proc. 7th In-
[16] J.W. Grzymala-Busse, On the unknown attribute values tl Workshop on Rough Sets, Fuzzy Sets, Data Mining,
in learning from examples, Proc. 6th Intl Symposium on and Granular-Soft Computing (RSFDGrC99), pp. 7382,
Methodologies for Intelligent Systems, pp. 368377, 1991. 1999.
[17] J.W. Grzymala-Busse and A. Y. Wang, Modified algo- [38] J. Stefanowski and A. Tsoukias, Incomplete information
rithms LEM1 and LEM2 for rule induction from data with tables and rough classification, Computational Intelli-
missing attribute values, 5th Intl Workshop on Rough Sets gence, vol. 17, pp. 545566, 2001.
and Soft Computing at the 3rd Joint Conf. on Information [39] S.K. Pal and P. Mitra, Case Generation Using Rough
Sciences, pp. 6972, 1997. Sets with Fuzzy Representation, IEEE Transactions on
[18] J.W. Grzymala-Busse, Characteristic relations for incom- Knowledge and Data Enginerring, vol. 16, no. 3, pp. 292
plete data: a generalization of the indiscernibility relation, 300, 2004.
Proc. 3rd Intl Conf. on Rough Sets and Current Trends in [40] G.Y. Wang, Extension of rough set under incomplete
Computing, pp. 244253, 2004. information systems, Proc. 11th IEEE International Con-
[19] J.W. Grzymala-Busse, Data with missing attribute values: ference on Fuzzy Systems, pp. 10981103, 2002.
generalization of indiscernibility relation and rule rnduc- [41] G.Y. Wang, Rough reduction in algebra view and infor-
tion, Transactions on Rough Sets I, Lecture Notes in mation view, International Journal of Intelligent Systems,
Computer Science, vol. 3100, pp. 7895, 2004. vol. 18, pp. 679688, 2003.
[42] X.Z. Wang, E.C.C. Tsang, S.Y. Zhao, D.G. Chen and D.S.
[20] M. Kryszkiewicz, Rough set approach to incomplete
Yeung, Learning fuzzy rules from fuzzy samples based
information systems, Information Sciences, vol. 112, pp.
on rough set technique, Information Sciences, vol. 177,
3949, 1998.
pp. 44934514, 2007.
[21] M. Kryszkiewicz, Rules in incomplete information sys-
[43] W.Z. Wu, W.X. Zhang and H.Z. Li, Knowledge acquisi-
tems, Information Sciences, vol. 113, pp. 271292, 1999.
tion in incomplete fuzzy information systems via the rough
[22] M. Kryszkiewicz, Comparative study of alternative types set approach, Expert Systems, vol. 20, pp. 280286, 2003.
of knowledge reduction in inconsistent systems, Interna- [44] X.B. Yang, J. Xie, X.N. Song and J.Y. Yang, Credible
tional Journal of Intelligent Systems, vol. 16, pp. 105120, rules in incomplete decision system based on descriptors,
2001. Knowlege Based Systems, vol. 22, pp. 817, 2010.
[23] Y. Leung and D.Y. Li, Maximal consistent block tech- [45] X.B. Yang, J.Y. Yang, C. Wu and D.J. Yu, Dominance-
nique for rule acquisition in incomplete information sys- based rough set approach and knowledge reductions in
tems, Information Sciences, vol. 115, pp. 85106, 2003. incomplete ordered information system, Information Sci-
[24] Y. Leung, W.Z. Wu and W.X. Zhang, Knowledge ac- ences, vol. 178, pp. 12191234, 2008.
quisition in incomplete information systems: a rough set [46] D.S. Yeung, D.G. Chen, E.C.C. Tsang, J.W.T. Lee and
approach, European Journal of Operational Research, X.Z. Wang, On the generalization of fuzzy rough sets,
vol. 168, pp. 164180, 2006. IEEE Transactions on Fuzzy Systems, vol. 13, pp. 343
[25] J.S. Mi, W.Z. Wu and W.X. Zhang, Approaches to 361, 2005.
knowledge reduction based on variable precision rough [47] L.A. Zadeh, Fuzzy set, Information and Control, vol. 8,
set model, Information Sciences, vol. 159, pp. 255272, no. 3, pp. 338353, 1965.
2004. [48] W.X. Zhang, J.S. Mi and W.Z. Wu, Approaches to knowl-
[26] S. Nanda and S. Majumdar, Fuzzy rough sets, Fuzzy Sets edge reductions in inconsistent systems, International
and Systems, vol. 45, pp. 157160, 1992. Journal of Intelligent Systems, vol. 18, pp. 9891000,
[27] Z. Pawlak, Rough setstheoretical aspects of reasoning 2003.
about data, Kluwer Academic Publishers, 1991. [49] Y. Zhao, Y.Y. Yao and F. Luo, Data analysis based on
[28] Z. Pawlak, Rough set theory and its applications to data discernibility and indiscernibility, Information Sciences,
analysis, Cybernetics and Systems, vol. 29, pp. 661688, vol. 177, pp. 49594976, 2007.
1998. Minlun Yan received her BS degrees in mathematics from the
[29] Z. Pawlak, Rough sets and intelligent data analysis, Soochow University, Suzhou, in 1987. Now, he is an associate
Information Sciences, vol. 147, pp. 112, 2002. professor at the Lianyungang Teachers college. Her research
[30] Z. Pawlak, Some remarks on conflict analysis, European interests include pattern recognition and rough set.
Journal of Operational Research, vol. 166, pp. 649654,
2005.
[31] Z. Pawlak and A. Skowron, Rudiments of rough sets,
Information Sciences, vol. 177, pp. 327, 2007.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1385

A Novel PIM System and its Effective Storage


Compression Scheme
Liang Huai Yang, Jian Zhou, Jiacheng Wang
School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310014, China

Email: yang.lianghuai@gmail.com

Mong Li Lee
School of Computing, National University of Singapore, Singapore
Email: leeml@comp.nus.edu.sg

AbstractThe increasingly large amount of personal The goal of PIM[6,7] is to offer easy access and
information poses a critical problem to users. Traditional manipulation of all of the information on a person's
file organization in hierarchical directories is not suited to desktop, with possible extension to mobile devices,
the effective management of personal information. In order personal information on the Web, or even all the
to overcome the shortcomings of the current hierarchical
file system and efficiently organize and maintain personal
information accessed during a person's lifetime. Personal
information, some new tools are expected to be invented. In information has a great diversity, which ranges from
this paper, we propose a novel scheme called concept space - office documents, PDF documents, emails, XML data,
a network of concepts and their associations and use topic relational data, music files, images, to videos etc. Besides
map as the underlying data model. We present a heterogeneity, the data is distributed in laptops, desktop
materialized view scheme to provide users with a flexible PCs, mobile phones, local systems, email servers and
view of the file system according to their own cognition. We other network systems, leading to information
also reduce the storage requirement to save space usage of fragmentation[2]. The increasingly large amount of
this system by borrowing some ideas from XML data personal information poses a critical problem to users.
management and contriving a novel and efficient data
compression scheme. To demonstrate the effectiveness of the
Thus, how to integrate these data is one of the challenges
above idea, we have implemented a prototype personal of PIM.
information management system called NovaPIM and Traditional file organization in hierarchical directories
presented its system architecture. Extensive experiments may not be suited to the effective management of
show that our proposed scheme is both efficient and personal information because it ignores the semantic
effective. associations and bears no connection with the
applications that users will run. Further, the physical
Index TermsPersonal Information Management, concept hierarchical directories can give user only one view.
space, data compression In this paper, we introduce the notion of concept space
to manage the collection of personal information objects.
Similar to the view of relational database, the term
I. INTRODUCTION
concept represents users logical view of the personal
Personal information management (PIM) refers to the information items which may locate in different file
activities people performed to acquire, store, organize and directories in the file system. We utilize the graphical
retrieve their items of digital information for everyday data model to organize the concept space where the nodes
use [1]. PIM gained intensive attention in recent years [11, are the concepts and the edges are the shortcuts to the
12,13,14,15,16]. Academic research on personal specific files or hyperlinks to some specific html
information tools stems from the early days of Hypertext documents. Consequently, there will be large number of
research including Vannevar Bush's vision of a PIM such links in this system. These links share many prefixes
device called memex [10] more than six decades ago, and gives us the opportunity to compress the contents.
a memex is a device in which an individual stores all his Based on these ideas, we design a PIM system called
books, records, and communications, and which is NovaPIM that provides flexible views of the information
mechanized so that it may be consulted with exceeding items and overcomes the weakness of current file system
speed and flexibility. It is an enlarged intimate which has only one monolithic physical organization.
supplement to his memory. However, most systems only NovaPIM also use a dictionary based compression
provide a fixed, rigid hierarchical file organization. To scheme to reduce the overheads of the storage space for
make matters worse, there is no alternate way, for the contents of the graphical data model. Experiments
example, using different views, to access the personal results verify the effectiveness of this scheme.
information. The remainder of the paper is organized as follows.
Section II reviews briefly the related work on PIM;
Manuscript received OCT 10, 2011; revised NOV 2, 2011; accepted Section III describes the architecture of our system
NOV 4, 2011. NovaPIM; Section IV addresses the issue of storage

Correspondent author

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1385-1392
1386 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

compression on NovaPIM; and experimental evaluations Keyword based information retrieval is not sufficient for
for our proposed scheme are given in Section V. Finally, PIM, hence a flexible querying scheme is desired [26].
we summarize our work in Section VI.

II. RELATED WORK


PIM has attracted much attention for decades since
Vannevar Bushs memex[10]. The increasingly large
amount of personal information(emails, sms, documents,
photos, videos, etc.), available from PCs, mobile phones,
PDAs, digital cameras, internet etc., poses a critical
problem to users. How to manage and organize this
information for personal productivity? As such, much
research has focused on this issue in recent years[11,12,
13,14,15,16]. PIM workshops sponsored by NSF (USA)
have been held for several years since 2005. Whittaker[28]
reviews research on three different information curation Figure 1. Architecture model of iMeMex PSDMS
processes: keeping, management and exploitation. A
series of research prototypes have been proposed in the iMeMex[14,13] uses RDF as its knowledge
academic community: SIS[20], Lifestreams[21], representation model. RDF is more "low-level" than the
Agenda[23], gIBIS[24], Rufus[25], iMeMex [14,13], topic maps[19]. In RDF, resources are represented as
SEMEX[6], Haystack[12], MyLifeBits[11], etc. Among triplets (subject, predicate, object). In topic maps, topics
them, SIS and Lifestreams are document oriented retrieval have characteristics of various kinds: names, occurrences
system, while the early tools like Agenda, gIBIS and and roles played in associations with other topics. The
Rufus, and the recent ones like SEMEX, Haystack and essential semantic distinction between these different
MyLifeBits, are based on relational model, and all data are kinds of characteristic is absent in RDF. And more often
uniformly represented in this data model. This approach than not, schema is absent from PIM. Consequently, the
can take advantage of the mature technology of RDBMS. data model of topic maps is exploited as our system's
As RDBMS depends on rigid relational schema, this underlying model.
approach cannot fully meet the needs of PIM. iMeMex,
on the other hand, presents an iDM data model, which III. AN OVERVIEW OF OUR PIM SYSTEM NOVAPIM
characterize itself with a graph data model to express the
data space, provides a formal method to represent a A. The prototype PIM system--NovaPIM
unified view of resources(such as document, directory,
To overcome the shortcomings of the current
relational table, XML document, data stream etc.). The
hierarchical file system in managing personal information,
theoretical foundation of iMeMex is RDF (Resource
we propose to use concept to organize and manage the
Description Framework), iMeMex's architecture consists
collection of personal information objects. A concept is a
of three layer: application layer, PDSMS (Personal Data
logical view that a user uses to organize the information
Space Management System) layer, and resource layer, as
items (files, URLs, emails) and includes one or more
shown in Figure 1. Stephen[22] gives a good description
sub-concepts or topics. For each sub-concept or topic, it
of PIM issues from the perspective of personal
can be materialized to a file which may contains one or
knowledge database. He addresses such issues as data
more shortcuts or hyperlinks to the physical files. These
model of personal knowledge database, the theoretical
concepts form an information space that we call it
problems involved, and taxonomy of PIM tools.
Concept Space, which is a network of concepts and their
There exist many PIMS tools in industry but still far
associations. Based on this idea, we have implemented a
from satisfactory. Here we enumerate some popular ones:
prototype PIM system called NovaPIM shown in Figure 2.
Microsoft's OneNote, Micro Logic's Info Select 1 and
Concepts are visualized as a tree-structure shown on the
Thomson's EndNote2. OneNote makes note taking easier,
left-hand tab folders while their relationships are
but it is an independent application separated from other
represented as folder and sub-folder or shortcuts/
applications as email client, Internet explorer, and file
hyperlinks in a file which can be edited/rendered on the
system. Info Select integrates email and note taking
right-hand tab folder. Figure 2 shows an example concept
functionalities into it, it is a good supplement to file
of Path Compression.
system[15]. EndNote focuses on the management and
Next, well discuss the system architecture and its
organization of literature without considering other
rationales behind it.
aspects of PIM.
As PIM involves a diversity of data types with their
implicit semantics and lack of associations between data B. System Architecture
objects, traditional desktop systems are incapable of PIM. Figure 3 illustrates the architecture of our system
NovaPIM. It consists of three layers: Application Layer,
1
http://www.miclog.com/software/ Concept Space Layer and Resource Layer.
2
http://www.endnote.com/ 1) Application Layer

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1387

Application Layer is the top layer of the system. It semantic network and take a more practical approach. For
provides various functionalities related to personal instance, we are using synonyms or even generic
information management, such as query and search, email related-to to express relationship between concepts. In
service, task management, agenda scheduling, etc.. The the future, we may introduce other associations into our
functions may be composed from existing independent system. The reason underlying this preference is detailed
applications through application integration, e.g., in place below.
activation of Microsoft office application via OLE
techniques. Some have to be constructed from scratch.
For example, the capability of flexible querying/search
for PIM system is desired, here we envision that the
system provides not only the traditional keyword based
searching capability but also the structure querying
capability, even the DB&IR capability[8]. Much of the
user interaction with PIMS involves exploring the data,
and users do not have a single schema to which they can
pose queries. Consequently, it is important that queries
are allowed to specify varying degrees of structure,
(a)editing a topic with drag/drop support
spanning keyword queries to more structure-aware
queries. The query system should be able to exploit both
exact matching and approximate matching scheme.
2) Concept Space Layer
As stated above, we use concept to organize / associate
PIM items. Concepts are assumed to be basic constituents
of thought and belief, and the basic units of thought and
knowledge that underlie human intelligence and (b)application embeddings
communication [17]. Every concept consists of the
Figure 2. Our prototype PIM systemNovaPIM
intension and the extension. The intension of a concept
consists of all properties or attributes that are valid for all
those objects to which the concept applies. It is an
abstract description of common features or properties
shared by elements in the extension. The extension of a
concept is the set of objects or entities which are
instances of the concept, or rather, the extension consists
of concrete examples of the concept. All objects/entities
in the extension have the same properties or attributes
that characterize the concept. A concept is thus described
jointly by its intension and extension. All the concepts
and their associations form the Concept Space Layer
which is the core of the PIM system. In essence, it is a
graph-data model.
The concept of concept space was first proposed from
Figure 3. The Architecture of our system NovaPIM
information retrieval perspective by Deng[3] in 1983,
where he stated that Concept Space was composed of the
concepts and the semantic network. Concept reflects the
objective nature of things and the characteristics of the
general. As we know, semantic network is a knowledge
representation scheme involving nodes and links (arcs or
arrows) between nodes, where the nodes represent objects
or concepts and the links represent relations between
nodes. The links are directed and labeled; thus, a
semantic network is a directed graph. In theory, this
definition is consistent with ours though there is no
standard definition of Concept Space by now. Figure 4. The network of concepts.
As stated before, the data model of Topic Maps is
exploited as our system's underlying model. A type During using and maintaining the personal information
hierarchy is helpful to enforce is-a relationship. items, different users may have different view of the
However, we have no pre-defined schema (or concept same items. In addition, the user's viewpoint of things
hierarchy) in our system. Instead, we use the extensions may evolve as his cognition advances or time goes by.
of concepts. In this respect, it's truly different from the Users may classify/categorize the files according to
traditional DBMS. In addition, we relax the use of different considerations (e.g. file content, file type,

2012 ACADEMY PUBLISHER


1388 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

subject, and time modified/downloaded) in different usage for its lack of schema. As a result, we resort to IR.
situations. Hence, it is impractical to force users to use a The combination of database technology and information
predefined semantic network in managing their retrieval may be the best rescue.
information collection. Consequently, NovaPIM takes a The extent of a concept is defined and edited via a
more practical stand, which allows users to freely define HTML editor and the hyperlinks therein links to other
their own concepts by using their own terms. Here, each concepts or physical resources. In the future, we'll
concept refers to a collection of resources that have introduce concept/view definition language to define
similar or related information. NovaPIM realizes the concepts. Another consideration is to improve on-demand
associations between concepts through shortcuts/ displaying of contents (concept/view) by active XML[27].
hyperlinks or folders/subfolders. These concepts form a NovaPIM combines the tree and flat file to present its
network as shown in Figure 4. As a result, such a scheme hierarchical structure and graph structure. Concepts are
allows users to have flexible views of the same stored in XML file and the relationship is realized via ID
information items, which is reminiscent of the view reference and hyperlinks that will be stated below. In
scheme in RDBMS. As such, this scheme is so powerful application layer, NovaPIM implements the embedding of
that it relieves people from the restrictions of one several applications such as PDF, HTML, Office, media,
physical organization allowed in the file system. With email, etc., and provides a flexible query scheme which
this scheme, users can build up a network of concepts incorporates concept hierarchy, file directory and
with different needs (e.g. the need of work, personal document content(refer to Figure 2 (b)). NovaPIM also
preference or habit). In addition, inductive inference and realizes email/task association. When an email arrives,
learning can be exploited to derive relationships between the system will check the contents of the mail and
the intensions of concepts based on the relations between compute its similarity with those tasks/topics/subjects
the extensions of concepts. Through the connections already defined.
between extensions of concepts, one may establish NovaPIM overcomes the weakness of the current file
relationships between concepts[18]. system. It provides the physical data independence
As a result, the Concept Space Layer uses concepts to through the mapping between concept space layer and
represent the data/file objects of various types/formats physical resource layer. NovaPIM provides a view
and interconnects them. This abstraction of resources scheme for user to create his own concept hierarchy
facilitates the user and the Application Layer. Concept according to his cognitions. NovaPIM has a drag & drop
Space Layer acts as the mediator between the Application scheme to define the extent of a concept without
Layer and the physical Resource Layer and is the core of changing the file directory structure physically. One only
PIMS. It achieves the physical data independence through needs to drag and drop the files into a HTML editor(refer
the mapping of Concept Space Layer/Resource Layer. to Figure 2 (a)); the shortcuts will be embedded into the
3) Resource Layer editor, e.g., file:///C:/publication/PIMS/AsWeMay
The personal information may be in various forms. It Think.pdf. The extent of a concept is similar in some
can be a document (in formats as word, txt, pdf, mp3, sense to the materialized view in DBMS.
rmvb, wav, etc.), email, URL, etc. or it can be structured By this way, some associations of concepts are
data in DBMS. It is noteworthy to mention that materialized as shortcuts/hyperlinks to the specific file
unstructured data comprises the vast majority of data indicated by a specific file directory path/URL/URI
found in an organization, some estimates run as high as (hereafter, we call this a locator, hyperlink or path for
80%[9]. In personal information items, this number simplicity). A locator is a string which is composed of a
becomes even larger. In PIM system, the management of series of label names separated by path separator / and
unstructured text is of primary importance. The items ends with a file name. With this, we can uniquely
may scatter in the different locations/directories in the file determine the file location in the file directory
system. structure/internet, and achieve the mapping from the
logical model (Concept Space) to the physical model (file
C. NovaPIM Implementation and Discussion system). The same item/file can be referenced at any
We have implemented a prototype system called number of times with no need of duplicating the
NovaPIM by Eclipse Java to demonstrate the proposed document. As the number of concepts increases, there
idea. A hyper graph model is adopted for our proposed will be large number of such links in this system. These
concept space. Each vertex in the graph is a concept (it is links probably share many prefixes and thus gives us
the extent of the concept in our case) and each edge is the opportunity to compress the contents. The issue of
association between concepts. Although the underlying content compression is addressed in next section.
theoretical foundation is topic map, it is relaxed in our
implementation. When a user uses unstructured data, the IV. COMPRESSION SCHEME FOR NOVAPIM
data is usually lack of schema or a user may not provide
the metadata. That's a big difference from traditional In NovaPIM, the extent of a concept is a collection of
database system which always has a set of predefined hyperlinks/locators. Taking all the shortcuts and
schema. As such, NovaPIM takes a more practical hyperlinks as whole, they form a tree. By borrowing the
approach. Currently it is a big challenge to define a view labeling scheme from XML data that are widely used in
for a concept using the topic map query language, and the XML query processing, we achieve the goal of data
topic map query language in NovaPIM has a very limited compression for NovaPIM. In this section, we adopt the

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1389

ORDPATH[4,5] labeling scheme to form a dictionary- the existing children, its label is generated by adding -2 to
based compression method to reduce the overheads of the last ordinal of the first child. ORDPATH supports
storage space for NovaPIM. insertion and update efficiently without relabeling any
existing label; it is also efficient to determine the parent-
child relationship. In our scenario, we use ORDPATH
value instead of file directory path. For instance, in Figure
5, the value 1.1.1.1 (7 characters) represents the path
C:\Music\Hero.mp3(17 characters) which reduces the
storage space by 10 characters. Though, the length of the
ORDPATH label will become long in case of deep trees
and trees with large fan-out. Overall, the length of
ORDPATH value is greatly shorter than that of file
directory path.

B. Storage structure
Figure 5. An example of ORDPATH encoding
For better managing the path tree, NovaPIM requires
an efficient storage structure to maintain the compression
dictionary. Two alternative ways exist. The first approach
is to adopt the adapted ORDPATH Value Table.
ORDPATH scheme uses a table with its schema as
R(ORDPATH, TAG, NODE TYPE, VALUE), where
ORDPATH is the encoding value of a file directory path
by using ORDPATH labeling scheme, TAG represents a
node label of a locator( from the file directory tree or
internet), the other two fields, i.e., NODE TYPE and
Figure 6. A Variant of ORDPATH Encoding VALUE, are not used in NovaPIM and can be omitted
therefore. In addition, we need another field called
TABLE 1. VALUE TABLE RefCount. In NovaPIM, a concept may contain/reference
ORDPATH Encoding TAG RefCount many items from various resources. An item can be
1 / 9 referred to by different concepts at the same time.
1.1 C: 4 RefCount indicates how many times a file or a document
1.3 D: 4 is referenced. Its value is maintained dynamically. If it
1.1.1 Music 2 reaches zero, this entry can be removed from table to save
1.3.1 Study 1
1.1.1.1 Hero.mp3 1 space. When adding a path into the Concept Space, e.g.,
1.1.1.3 Fearless.mp3 5 C:\Music\Hero.mp3, the value of RefCount of each
1.1.1.5 Belief.mp3 5 corresponding node (C:, Music, Hero.mp3) will be
1.3.1.1 XML 3 added by 1; When deleting a path, the value of RefCount
1.3.1.3 E-books 1 of each corresponding node will be reduced by 1. Hence,
1.3.1.1.1 XML.pdf 1
the new table schema becomes R' (ORDPATH, TAG,
1.3.1.1.3 DTD.doc 1
RefCount), one such example table is shown in TABLE 1
1.3.1.1.5 XML.ppt 2
for Figure 5. The Value table is maintained dynamically.
1.3.1.3.1 Java.pdf 1
When a new shortcut or hyperlink needs to be encoded,
1.3.1.3.3 C++.pdf 1
we firstly look up it in this dictionary. If it exists, we
replace it with its corresponding ORDPATH code. If not,
A. ORDPATH node labeling scheme we use ORDPATH labeling scheme to add it to the
dictionary. When performing add, delete, move and
ORDPATH is a prefix-based node labeling scheme, it
decode, these operations involve traverse a path and need
encodes the parent-child relationship by extending the
to visit several tuples in the table. Thus it leads to low
parent's ORDPATH label with a component for the child.
efficiency.
An example of the file directory path tree labeling using
The second approach is to use the encoding tree as the
ORDPATH is depicted in Figure 5. For example, 1.3.1
dictionary but with some adaptation. Each node has the
represents a parent node, 1.3.1.3 is its child node. The
encoding with the prefix removed, and adds RefCount
ORDPATH value (1.3.1.3) with dot separated ordinal
attribute. The encoding example for Figure 5 after taking
values (1, 3, 1, 3) reflects the successive levels
this approach is shown in Figure 6. The RefCount for
down the path from the root to the node represented.
each node is shown within the brackets of its encoding,
During the initial load, ORDPATH assigns only positive
indicating the total occurrences of this node in different
and odd integers, even and negative integers are reserved
paths. Similar to the ORDPATH Value Table, when
for later insertions. If the newly inserted node is to be
adding/deleting a path into the Concept Space, the value
added to the right of all the existing children, its label is
of RefCount of each corresponding node will be
generated by adding +2 to the last ordinal of the last child.
increased/decreased by 1. The operation of moving a
If the newly inserted node is to be added to the left of all

2012 ACADEMY PUBLISHER


1390 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

subtree can be seen as the synthesis of adding after the encoding value of the shortcut. The encoding
deletion. Node insertion doesn't incur the re-encoding of algorithm is depicted in Figure 7.
other nodes; node deletion doesn't affect the relationship 2) Decoding ORDPATH value into shortcuts/hyperlinks
of ancestor/descendant, parent/child, and siblings. When For a compression system, it is essential to get back the
encoding, we can get the path/locator encoding by original shortcut/hyperlink by decoding ORDPATH value
traversing and concatenating each node's encoding (say X = 1.1.1.1). By removing the rightmost component
corresponding to the path/locator labels; when decoding, of X (always an odd ordinal) and then all rightmost even
we combine each node's label by traversing the dictionary ordinal components [4], we get its parent (here 1.1.1) in
tree according to the encodings. This approach can reduce dictionary tree. Such process continues until up to root.
the space overhead of the compression dictionary and is Eventually, we recover the original shortcut/hyperlink by
very efficient when encoding/decoding. When the connecting the node successively by adding the separator
hyperlinks scale is not too large, this approach is a good \ in between.
choice. This paper takes this approach. 3) Delete a shortcut/hyperlink
Next, we describe the data compression algorithm When deleting a shortcut/hyperlink from its
below. corresponding concept, reduce the value of RefCount by 1
Algorithm: InsertHyperLink for all the nodes of shortcut/hyperlink according to
Input: P a hyper link to be inserted, DT- the dictionary tree
Output: the corresponding ORDPATH encoding of hyper link P, and
ORDPATH value by visiting all nodes by the way
the modified dictionary explained in Section 2) above. If the value of the
1.Parse P into tokens(label names) {P1, P2, P3} according RefCount becomes 0, the corresponding entry is deleted.
separator (\ , / or others) The corresponding deletion algorithm is shown in Figure 8.
2. Let current node Ncurrthe root node of DT;
3. For each label TP in {P1, P2, P3}, see if TP is among the labels TABLE 2 EXPERIMENT DATASET 1
of Ncurr .ChildNodes(): Paths Avg Avg Nodes Avg
(1) if exists: increase RefCount by 1 for the corresponding child Depth Length(B) Fanout
node with TP as its label.
(2) if not exists, create a new right-most child node with TP as 500 4.8 33.5 329 46.86
its label for Ncurr, encode this new node and set RefCount to 1000 5.3 51.2 1011 21.04
1. 2000 5.3 48.7 1955 11.70
(3) Ncurrthe new child node; 3000 5.2 47.1 2669 10.34
4. Combining the ORDPATH encoding values for the nodes 4000 5.1 46.6 3999 7.56
corresponding to P's label with . as their separator and return it. 5000 4.6 43.3 5006 8.56
Figure 7. Adding a new HyperLink 7000 5.3 45.7 6307 15.20
11000 4.9 45.6 10036 9.67
Algorithm: DeleteHyperLink
Input: P - a hyper link to be removed, DT- the dictionary tree TABLE 3 EXPERIMENT DATASET 2
Output: the modified dictionary Paths Avg Avg Nodes Avg
1. Parse P into tokens(label names) {P1, P2, P3} according Depth Length(B) Fanout
separator (\ , / or others) 5x10 4
8.7 76.2 50085 12.54
2. Let current node Ncurrthe root node of DT; 105 8.8 75.9 100152 11.95
3. For each label TP in {P1, P2, P3}, see if TP is among the labels
of Ncurr .ChildNodes(): 106 7.2 61.5 1000018 14.51
(1)if exists: decrease RefCount by 1 from the corresponding
node with TP as its label. If RefCount becomes 0, then delete
this node and return; else Ncurrthe child node with TP as its V. PERFORMANCE EVALUATION
label;
(2)if not exists, report error and return; A. Experimental Environment and Data Generation
Figure 8. Deleting a HyperLink The experiments were performed on an Intel Core 2
Duo 2.2GHz CPU with 2GB memory, running Window 7.
C. Data Compression Algorithm We produce two data sets from the real-life file
directories shown in TABLE 2 and TABLE 3. The first
1) Shortcuts/hyperlinks Encoding
column indicates the total number of distinct
When adding a new item into Concept Space, the related
shortcuts/hyperlinks; column two, three and five are the
information of shortcut or hyperlink of the item may need
average values of the depth, length and fanout of the
to be added into the dictionary tree. Here we take shortcut
collections respectively; and the total number of nodes is
as the example for description. First, we separate a
given in the fourth column. Note that the fanout is the
shortcut (e.g. C:\Music\Hero.mp3) into its individual
fanout of the dictionary tree. In TABLE 2, the average
parts (C:, Music, Hero.mp3) by the separator \;
depth of the data set is 5, and the average path length is
Next, we check whether the leading node (C:) exists or
45.7(Bytes). The data set in TABLE 3 shows that the
not; if exists, check whether the sequent node (Music)
average path depth is 8, and its average path length is
exists or not within its leading node's children; if the
71.2(Bytes).
leading node doesn't exist, then encode the leading node
into ORDPATH value. The remaining node (i.e. B. Experimental results
Hero.mp3) is handled the same way as stated above. All results are shown in tables. The acronyms of the
Finally, for each node accessed, the value of RefCount will table header are explained below. Paths is the total
be added by 1, the ORDPATH value of the last node is number of distinct paths/locators; RC is the reference

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1391

count RefCount which indicates the number of times a distribution N(0,52) with their average reference count
resource is referred to by different concepts; ARC is the around 3.5.
average reference count; DS is the space occupied by the The experimental results on data set 1 are shown in
dictionary tree in KB; ES is the space consumed by TABLE 4, TABLE 6 and TABLE 8. With the increasing
encoding in KB; CS is the space consumed after of the average reference count, their average compression
compressing the data set(hyperlinks) in KB; it consists of ratio grows accordingly from 14%, 39% to 52%. The
two parts: the size of encoded data and the size of the experimental results on data set 2 is shown in TABLE 5,
dictionary tree and thus it holds that CS = DS + ES; OS is TABLE 7 and TABLE 9, their average compression ratio
the space used without compression in KB; By grows accordingly from 25%, 44% to 53% as the average
comparing the compressed space CS to the original data reference count increases.
size OS, we got the compression ratio CR(%), i.e., CR =
TABLE 8 EXP. RESULT OF DATASET 1 WITH RefCount
(OS - CS)/OS*100.
CONFORMING TO NORMAL DISTRIBUTION N(0, 52)
TABLE 4 EXP. RESULT OF DATASET 1 WITH RefCount SET TO 1
Paths ARC DS ES CS OS CR
Paths RC DS ES CS OS CR 500 3.41 7 13 20 55 63
500 1 7 4 11 16.37 33 1000 3.42 29 45 74 170 56
1000 1 29.42 13.41 42.83 50.00 14
2000 3.47 51 101 153 330 54
2000 1 51.84 28.99 80.82 95.19 15
3000 1 68.08 39.58 107.66 137.84 22 3000 3.52 68 139 207 486 57
4000 1 100.71 66.53 167.24 182.03 8 4000 3.52 100 233 333 636 48
5000 1 133.40 79.59 212.99 211.25 -0.8 5000 3.53 133 280 413 741 44
7000 1 168.14 96.63 264.77 312.27 15 7000 3.47 168 335 503 1076 53
11000 1 262.31 182.39 444.70 490.03 9 11000 3.49 262.31 637 899 1713 47

TABLE 5 EXP. RESULT OF DATASET 2 WITH RefCount SET TO 1 TABLE 9 EXP. RESULT OF DATASET 2 WITH RefCount
CONFORMING TO NORMAL DISTRIBUTION N (0, 52)
Paths RC DS ES CS OS CR
5x104 1 1518 1222 2740 3723 26 Paths ARC DS ES CS OS CR
5x104 3.50 1518 4281.7 5799.6 13042.7 55.53
105 1 3020 2504 5524 7414 25
106 1 21365 23975 45340 60048 24 105 3.50 3020 8775 11795 25988.3 54.62
106 3.51 21365 84019 105384 210434 49.92

TABLE 6 EXP. RESULT OF DATASET 1 WITH RefCount


CONFORMING TO NORMAL DISTRIBUTION N(0, 32) VI. CONCLUSION AND DISCUSSION
Paths ARC DS ES CS OS CR
500 1.85 7 7 14 30 53 This paper proposes to use the idea of concept space to
1000 1.85 29 25 54 92 41 manage personal information and exploit topic map as the
2000 1.89 51 55 106 180 41 underlying data model. Based on this, the paper presents
3000 1.91 68 75 143 265 46 the prototype system NovaPIM. NovaPIM integrates
4000 1.91 100 127 227 346 34 many desktop applications through application
5000 1.92 133 152 285 403 29 embedding and give a solution to the problem of physical
7000 1.88 168 182 350 584 40
data independence. A materialized view scheme is
11000 1.90 262 348 610 937 35
provided to view the file system from different
perspectives according to user's own cognition. Users can
TABLE 7 EXP. RESULT OF DATASET 2 WITH RefCount
define his concept through drag & drop without
CONFORMING TO NORMAL DISTRIBUTION N (0, 32)
physically changing the directory structure. NovaPIM
Paths ARC DS ES CS OS CR combines both the tree and graph model to organize and
5x104 1.91 1518 2341 3859 7132 46 manage the data collection. For the diversity of personal
105 1.91 3020 4796 7816 14206 45
106 1.92 21365 45949 67314 115077 42
data, any single data model is not sufficient. The
combination of several data models may be the only right
way.
TABLE 4 ~ TABLE 9 illustrate the effectiveness of
With the help of shortcuts/hyperlinks, we represent the
our compression scheme on storage space. The
concept space in a graphical model. We adopted the
experimental results, where one hyperlink is only referred
ORDPATH label scheme to reduce the storage overheads
to once by a concept, are shown in TABLE 4 and TABLE
of file directory path. The experimental results show its
5. In this case, the shared prefixes contribute to the effect.
effectiveness.
From the results we know that the average compression
As pointed out before, the lack of schema is the
ratio is about 14% and 25% respectively.
intrinsic nature of PIMS. To solve this issue, data mining
TABLE 6 ~ TABLE 9 give the results when a
and machine learning techniques should come into play
hyperlink is referred to more than once by several
to discover the schema among data collection and find the
concepts. In TABLE 6 and TABLE 7, the reference count
relationships between concepts. An appropriate view
conforms to the normal distribution N(0,32), and their
definition language is desired to face the schema lack
average reference count is near 2; in TABLE 8 and
environment. All these are in our future research agenda.
TABLE 9, the reference count conform to the normal

2012 ACADEMY PUBLISHER


1392 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

ACKNOWLEDGMENT [20] S. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, D. C.


Robbins. Stuff I've seen: a system for personal information
This work is supported in part by Zhejiang Provincial retrieval and re-use. SIGIR conference, 2003, pp.7279.
NSF Project(Y1090096) and the NSFC Project [21] S. Fertig, E. Freeman, and D. Gelernter. Lifestreams: An
(61070042). alternative to the desktop metaphor. In Conference
Companion on Human Factors in Computing Systems:
REFERENCES Common Ground, 1996, pp. 410411.
[22] S. Davies. Still Building the Memex. Communications of
[1] M. Lansdale. The psychology of personal information the ACM, 2011, 54(2):80-88.
management. Applied Ergonomics, 19(1), 1988, pp.55-66. [23] S. J. Kaplan, M. D. Kapor, E. J. Belove, R. A. Landsman,
[2] W. Jones. Finders, keepers? The present and future perfect and T. R. Drake. Agenda: A personal information manager.
in support of personal information management. First Commun. ACM 33, 7 (July 1990), pp.105116.
Monday,2004,http://www.firstmonday.dk/issues/issue9_3 [24] J. Conklin and M. L. Begeman. gIBIS: A hypertext tool
/jones/index.html. for exploratory policy discussion. ACM Transactions on
[3] L. H. DENG. Library and Information Mathematics, Office Information Systems, Vol. 6, No. 4, October 1988,
Northeast Normal University, 1983. pp.303-331.
[4] P. O'Neil, E. O'neil, S. Pal, L. Cseri, G. Schaller, and N. [25] K. Shoens, A. Luniewski, P. Schwarz, J. Stamos, J.
Westbury, ORDPATHs: Insert-Friendly XML Node Thomas. The Rufus System: Information Organization for
Labels, Proceedings of the ACM SIGMOD, 2004, Semi-Structured Data. In VLDB, pp.97-107, 1993.
pp.903-908. [26] W. Wang, A. Marian, T. D. Nguyen. Unified Structure and
[5] R. Alkhatib and M. H. Scholl. Compacting XML Content Search for Personal Information Management
Structures Using a Dynamic Labeling Scheme. BNCOD, Systems. International Conference on Extending Database
2009, pp.158-170. Technology, pp. 201-212 , 2011.
[6] X. Dong and A. Halevy. A Platform for Personal [27] S. Abiteboul, O. Benjelloun, T. Milo. Positive Active XML.
Information Management and Integration. CIDR, 2005. PODS Conference, 2004, pp.35-45.
[7] S. T. Dumais, E. Cutrell, J. J. Cadiz E., G. Jancke, R. Sarin, [28] S. Whittaker. Personal Information Management: from
and D. C. Robbins. Stuff I've seen: A system for personal information consumption to curation. Annual review of
information retrieval and re-use. SIGIR, 2003, pp.72-79. information science and technology (ARIST), Vol. 45
[8] S. Chaudhuri, R. Ramakrishnan, G. Weikum. Integrating (2011), pp. 3-62.
DB and IR Technologies: What is the Sound of One Hand Liang Huai Yang is a professor at
Clapping?. CIDR, 2005. Zhejiang University of Technology. He
[9] C. C. Shilakes and J. Tylman, "Enterprise Information received the BSc. degree in Information
Portals", Merrill Lynch, 16 November, 1998. Science (Department of Mathematics) in
[10] V. Bush. As we may think. Atlantic Monthly, 176(1), 1945, 1989 and the PhD degree in Computer
p:101-108. Science from Peking University in 2001.
[11] J. Gemmell, G. Bell, R. Lueder, SM Drucker, C. Wong. He assumed a research fellow position at
MyLifeBits: Fulfilling the Memex vision. Proc. of the 10th National University of Singapore during
ACM International Conference on Multimedia, 2002, 2001~2005. He has published about 40
pp.235-238. papers in major conferences and journals
[12] D. R. Karger, K. Bakshi, D. Huynh, D. Quan, V. Sinha. in the database field. He has served on the program committee
Haystack: A customizable general-purpose information of some database conferences, and as reviewers of some
management tool for end users of semistructured data. journals such as Information Sciences, Information Systems,
CIDR, 2005, pp.13-26. International Journal of Electronics and Computers, etc.
[13] J.P. Dittrich, M. Antonio, M. Salles. iDM: A unified and
versatile data model for personal dataspace management. Lee Mong Li is an Associate Professor and Assistant Dean in
VLDB, 2006, pp.367-378. the School of Computing at the National
[14] L. Blunschi, J. Dittrich, O. R. Girard, S. K. Karakashian, University of Singapore (NUS). She
and M. A. V. Salles. A Dataspace Odyssey: The iMeMex received her Ph.D. in Computer Science
Personal Dataspace Management System. CIDR, 2007, from NUS in 1999. She was awarded the
pp.114-119. IEEE Singapore Information Technology
[15] W. Jones, J. Teevan. Personal Information Management. Gold Medal for being the top student in the
Communications of the ACM, 49(1), 2006, pp.40-42. Computer Science program in 1989. Mong
[16] D. K. Barreau. Context as a factor in personal information Li joined the Department of Computer
management systems. Journal of the American Society for Science, National University of Singapore,
Information Science, 46(5), 1995, pp.327-339. as a Senior Tutor in April 1989 and was
[17] Y. Y. YAO. Concept Formation and Learning: A appointed Fellow in the School of Computing in February 1999.
Cognitive Informatics Perspective. Proceedings of the She was a visiting Fellow at the Computer Science Department,
Third IEEE International Conference on Cognitive University of Wisconsin-Madison, from September 1999 to
Informatics, 2004, pp. 4251. August 2000 and Consultant at Quiq Incorporated, USA from
[18] Y.Y. Yao. A step towards the foundations of data mining. June to August 2000. Her research interests include the cleaning
Data Mining and Knowledge Discovery: Theory, Tools, and integration of heterogeneous and semi-structured data,
and Technology V, B.V.Dasarathy(Ed.), The International database performance issues in dynamic environments, and
Society for Optical Engineering, 254-263, 2003. medical informatics. Her work has been published in database
[19] Topic Maps - XML Syntax. http://www.isotopic-maps.org/ conferences such as ACM SIGMOD, VLDB, ICDE and EDBT,
sam/sam-xtm/2006-06-19/ data mining conference ACM SIGKDD and database
conceptual modeling conference (ER).

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1393

Analyzing Effective Features based on User


Intention for Enhanced Map Search
Junki Matsuo , Daisuke Kitayama , Ryong Lee , Kazutoshi Sumiya
Graduate School of Human Science and Environment, University of Hyogo, Japan
National Institute of Information and Communications Technology (NICT), Japan
Email: nd11g028@stshse.u-hyogo.ac.jp , {dkitayama, sumiya} @shse.u-hyogo.ac.jp , lee.ryong@gmail.com

Abstract Map would be the most critical information in street information can be more emphasized than other
daily real-world activities. Due to the advance of the Web information for route guidance. Thus, looking for maps
and digital map processing techniques, we can now easily that match user intention is critical for users to obtain
find various maps of different presentations appropriate
to diverse user purposes such as trivial searching for a appropriate location information. For the purpose, we
restaurant or consulting a path during a trip. However, maps propose a map search system to retrieve maps that match
served by todays representative map search engines such user intention focusing on feature of map contents.
as Google Maps cannot satisfy all users whose map-reading The search engine can interact with users for relevance
ability and search purposes are quite different. Thus, map feedback. Relevance feedback allows users to make a
search engine need to provide maps well represented for
specific needs. Nowadays, there are numerous numbers of variety of requests. Additionally, it is possible to search
map contents available on the web, which are appropriately for better maps interactively. Especially, we introduce a
well drawn and shared on various web sites. However, it is map search engine which can show ranking of maps based
not an easy task for users to find out appropriate maps on user intention. In order to rank the maps, it is necessary
on the Web. In order to support users map search on to analyze map contents. In the map search engine, each
the Web, we developed a map search system, which can
search for map contents drawn in various viewpoints by map content is analyzed into two distinguishing features,
interacting with users based on a relevance feedback. In geographical features and image features. Geographical
particular, we analyze each map content according to two features represent some deformation of the real world
distinguishing features, geographical features and image by controlling map objects drawn such as the number
features. Significantly, the proposed system can deal with of objects, scale ratio, etc. On the other hand, image
visual map contents by considering how the map contents
are represented. In this paper, we analyze effective features features refer to visual effects as usual images such as
based on user intention for map search. image size, the overall mean of color components, etc.
Therefore, it is possible to search for maps that match user
requirements when the users also understand the maps
I. I NTRODUCTION based on these features. For example, a user traveling
Due to the advance of the Web and digital map by train may unconsciously focus on a feature like the
processing techniques, we can now easily find various number of station on the map. Although maps have a
maps of different presentations appropriate to diverse variety of features, excessively given features may result
user purposes such as trivial searching for a restaurant in harmful effects for ranking. Therefore, we need to
or consulting a path during a trip. Obviously, maps select effective features for map search based on users
must be one of the foremost useful contents for daily implicit intention. We especially consider two types of
outdoor activities. There are some online maps for general purposes for user requirements: object confirming and
purposes such as Google Maps [1], Bing Maps [2], and path finding. Specifically, we define features for analysis
so on. However, these systems cannot satisfy all users of maps and show the candidates of these features. Then,
whose map-reading ability and search purposes are quite we extract effective features for ranking based on each
different. Map for general purposes would rather confuse purpose. To achieve our goal, we apply a support vector
users, because of excessively drawn objects on maps. machine (SVM) [4] to analyze features emphasized by
As for the difference of map recognition by people, users for each purpose, which construct a classification
Kobayashi et al. [3] showed that there are individual by dividing the training data into positive or negative
differences in peoples map-reading ability. Generally, classes with a hyperplane. Based on this method, we
poor map readers are suffering from the abundance of can effectively extract features that can classify unknown
information unnecessarily given on the general maps. maps into usable and non-usable maps.
Therefore, it would be an interesting challenge to search The remainder of this paper is as follows. Section
for maps appropriately represented and fitted for user 2 describes the concept of map search engine for our
purposes. Even a user with poor map-reading ability can approach. Section 3 reviews related work. Section 4 ex-
understand appropriate route information more easily, if plains how to analyze effective features based on purposes
maps specialized for route guidance are given, where by SVM. Section 5 discusses the experiment for feature

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1393-1402
1394 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Figure 1. Map search engine

extraction based on users participation. Finally, Section Hence, we assume search engine using relevance feedback
6 concludes this paper with future work. for representing user requests. This method allows users
to select appropriate maps in the displayed maps for
II. S EARCH E NGINE FOR M APS representing their requests. Among other requirements,
map search engines must consider how maps are usable
In general, maps describe part of the real world by for user purposes. We believe that usability of maps
making a representation to real-world objects according are depending on user purposes. Thus, we consider two
to a variety of viewpoints. Such modified maps would types of purposes as user requests: object confirming
be usable if they are well represented for specific pur- and path finding. Hence, map search engines first need
poses. For instance, on route guidance maps, paths to to recognize the users purpose, and show usable maps
a destination can be emphasized than other information. for each purpose. In order to determine the usability
These maps help users to get to a certain destination of maps for each purpose, it is necessary to analyze
by providing route information. Thus, users can find out the components of maps. Because maps are an image
appropriate regional information by the maps matching described regional information, there are two distinguish-
user requirements. ing features constructing maps: geographical features and
Due to the growth of the Web, there are large numbers image features. Geographical features explains geometric
of maps on the Web. However, it is not an easy task information, while image features depict map images in
to find suitable maps by using existing search engines. terms of graphics. Map search engines determine appro-
For example, image search engine allows users to obtain priate maps for users on the basis of these features and
many maps. However, this retrieval method does not show ranking of maps.
consider the essence of maps. In order to present maps
appropriate to user purposes, it is necessary to analyze
their own contents in terms of cartography, but image
search engine uses limited elements like color components
or surrounding texts without considering map features Figure 1 shows a concept of our map search engine.
such as the number of objects, scale ratio, etc. Therefore, At first, the system presents candidates for maps that may
it is difficult to search for maps reflecting user intention. have the required information through the user interface.
In addition, it is hard for user to represent their own Second, the user can select some usable maps that match
requests by using only keyword query. Consequently, their purpose. User requests are interpreted on the basis of
inappropriate and unrelated maps can appear in the search selected maps by mean of relevance feedback. Then, map
results. In order to resolve the problem, we developed a search engine takes user requests as a choice of object
search system for the retrieval of maps. confirming or path finding. Third, maps in database are
Map search engine needs a function to show appro- ranked on the basis of two types of features corresponding
priate maps by retrieving on the basis of user requests. to user purposes. Finally, the system shows a ranking
However, there are some problems for retrieving maps of maps. In addition, user can select some maps in
that match user purposes. At first, users may have a the ranking repeatedly. It is possible to retrieve better
variety of different intentions. For example, users may maps interactively because search query are improved
want to confirm a path to their destinations and request whenever new maps are selected. In this system, we
maps for route guidance. On the other hand, users could assumed a map database containing an adequate amount
require sightseeing information and look for maps show- of maps with map features consisting of geographical
ing some sightseeing spots. Map search engines have features and image features. Hence, map search engine
to interpret those requests. In addition, it is difficult for can satisfactorily rank the maps in a database by using
users to make these requests as search queries in detail. geographical features and image features.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1395

TABLE I.
G EOGRAPHICAL FEATURES (29 DIMENSION )
explanation
region the coordinates in real world the coordinates of (1) north, (2) south, (3) east and (4) west
the area in real world MBR containing all objects
scale ratio the ratio between MBR and size of image
appearance objects the number of all objects on the map
landmark objects (1) the number and (2) the ratio of landmark objects
path objects (1) the number and (2) the ratio of path objects
edge objects (1) the number and (2) the ratio of edge objects
district objects (1) the number and (2) the ratio of district objects
node objects (1) the number and (2) the ratio of node objects
distribution dispersion in an image (1) x-coordinate, (2) y-coordinate, and (3) total dispersion in an image
dispersion in real world (1) lat-coordinate, (2) lon-coordinate, (3) total dispersion in real world
the position of a certain object (1) x-coordinate, (2) y-coordinate and (3) the distance from the center of a certain object in an image
other cardinal direction the angle between upper direction and north direction
the information for route guidance existence of texts for route guidance other than geographical names
other information existence of texts other than geographical names and information for route guidance

III. R ELATED W ORK restaurant information by considering user preferences


and contexts. This study is similar to ours because they
A lot of studies have been conducted on maps. Honda
focus on considering user requirements. However, we
et al. [5] proposed the automated generation of deformed
believe that user requirements for maps differ from those
maps by using road deformation and landmark relocation.
for restaurants.
Fujii et al. [6] proposed a route guide map generation
system based on re-arranging the detailed maps. These
IV. F EATURES E XTRACTION BASED ON P URPOSES
studies aimed to generate maps. However, maps generated
without considering user requests are often uniformed. A. Features for ranking of maps
Our method can present appropriate maps for each user In this section, we describe features for ranking of
by analyzing features contained in maps and focusing on maps. On maps, real-world information is described by
user intention. using a variety of features. These features are classified
Methods for analyzing the components of maps are into two categories: Geographical Features (GFs) and
extensively researched. Agrawala et al. [7] analyzed the Image Features (IFs) , since map images are usually
generalizations commonly found in hand-drawn route drawn considering placement of objects as well as visual
maps. Osaragi et al. [8] proposed extraction key map effects. We define maps features as follows:
elements by analyzing roads and buildings that are rep-
M apF eatures = {GF, IF } (1)
resented in existing maps. Grabler et al. [9] proposed the
generation of tourist maps based on image analysis and Here, GF means a set of geographical features, where
Web-based information. Although these study analyzed each feature explains regional information. On the other
the maps, user requirements were not considered. We be- hand, IF is a set of image features, which are depicting
lieve that the maps are usable when described information map images in terms of graphics. Maps have a variety of
matches the user. Thus, our study requires considering features, however, some features may have harmful effects
map features on the basis of user intention. for ranking. Therefore, we especially extract effective
A variety of methods to find the maps satisfying a features for ranking according to users purposes.
various requirements have been investigated. Michelson 1) Geographical Features (GFs): We first explain the
et al. [10] proposed method for classifying maps from geographical features used to construct maps. Geograph-
images collected on the Web. In this work, their classifier ical features are physical quantities in maps that act as
is based on Water-Filling features, which is edge-based elements to explain regional information. For example,
features. Chiang et al. [11] built a map classification the latitude and the longitude of geographical objects
technique based on a nearest-neighbor classifier using the are elements that explain the location in relation to the
luminance-boundary histogram. The luminance-boundary real world in a manner that is physically quantifiable.
histogram is an image comparison feature. Newsam et However, geographical features differ from map to map.
al. [12] proposed content-based image retrieval against a In particular, maps are specialized in showing certain
target set of geographic images by using visual features. regions or for some specific purposes. Thus, we consider
These studies use only image features for interpreting the that each map has individualized features.
maps. However, it is difficult to consider user intention by Table I shows 29 geographical features we used in this
using only image features because user purposes would be study. In this study, we assumed a database containing
mainly expressed by using geographical features. Hence, an adequate amount of maps that are indexed. In features
we also use geographical features for ranking of maps. related to the region shown on the maps, it is possible
In the information recommendation field, Oku et al. to obtain coordinates by using geocoding. To estimate
[13] proposed a recommendation method that considers the areas shown on the maps, we used the minimum
user context through SVM. They used SVM to clas- bounding rectangle (MBR). In features related to geo-
sify items as suitable or not. Their method recommends graphical objects shown on the maps, we separated all

2012 ACADEMY PUBLISHER


1396 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE II.
I MAGE FEATURES (21 DIMENSION )
explanation
shape image size the number of pixels in a (1) column, (2) row and (3) total
color R (1) the overall mean, (2) the mean around a certain object, and (3) the difference of red color components
G (1) the overall mean, (2) the mean around a certain object, and (3) the difference of green color components
B (1) the overall mean, (2) the mean around a certain object, and (3) the difference of blue color components
hue (1) the overall mean, (2) the mean around a certain object, and (3) the difference of hue
saturation (1) the overall mean, (2) the mean around a certain object, and (3) the difference of saturation
brightness (1) the overall mean, (2) the mean around a certain object, and (3) the difference of brightness

objects based on five elements that make up the city [14].


Path objects denote elements such as streets and railways Discriminant
that people can pass through. Edge objects mean lineal
surface
elements like rivers. District objects mean regions with Usable maps
characteristics of the same quality internally, which we
allocated to the names of prefectures and cities. Node
objects mean elements of points of meeting. Examples
include stations and bus stops. The other objects are used Maximization
as landmark objects. We used the number and the ratio of margin Non-usable maps
of each element. In features related to the distribution
of the objects shown on the maps, we calculated the
Figure 2. Classification of maps using SVM
dispersions of coordinates in both the real world and
images. We then used the dispersion of both the x- and
y-coordinates, as well as the product of these dispersions.
Therefore, search engine considering of other features is
Furthermore we used the dispersion of both the latitudes
one of future investigation.
and longitudes, as well as the product of the dispersions.
In addition, we consider the coordinates of a certain object
and distance from center to a certain object. In other B. SVM-based Effective Feature Extraction
features, there is the information described on the maps We extract features that affect each purpose by using
other than geographical names. We used the appearance SVM (support vector machine). SVM is a learning ma-
of text denoting route guidance or the distance between chine for two-class classification. In SVM, training data
objects as the information features for route guidance. The are located on the coordinate space as feature vectors.
other information, such as advertisements for shops and SVM can classify unknown data into classes A and B
captions denote data other than route guidance. by constructing discriminant surfaces to classify training
2) Image Features (IFs): We explain image features data. In our study, we constructed discriminant model to
for constructing maps, because maps have not only geo- classify maps as usable or non-usable based on purpose
graphical features but also image features. Image features through SVM. Figure 2 shows a conceptual diagram for
are physical quantities in maps that act as elements the classification of maps using SVM.
for depicting information as an image. For example, Maps preliminarily classified by the user are located on
the brightness in a color component is an element for the N-dimensional coordinate space. In this figure, black
depicting the tone of the image. It is possible to use circles denote usable maps, and black squares denote non-
physical quantities of coloring or sizes of images to affect usable maps. It is possible to classify unknown maps into
the visibility or impression of maps. Hence, we estimate usable and non-usable maps by constructing discriminant
that these affect the usability assessment of maps. surface for classification based on maximizing the mar-
Table II shows 21 image features we used in this study. gin. In addition, white circles and white squares denote
As features related to the shapes of map images, we used support vectors for constructing the discriminant surface.
the number of pixels in the row, column, and total. These It is possible to locate maps by using a variety of
elements make up the size of image. As features related features on the coordinate space. We believe that location
to the colors of the map images, we used the overall mean on the basis of map features emphasized by user makes
of each pixel value, the mean of each pixel value around classification precision high. Hence, we try to classify by
a certain object according to the RGB and HSV color using various combinations of map features. We extract
model. Additionally we consider the difference between features that can classify unknown maps into usable and
overall mean and mean around a certain object. non-usable maps well. These features would be effective
In this paper, we used 50 features consisting of ge- for map search.
ographical features and image features. It is estimated In our study, we use SVM for features extraction
that these features are especially effective for ranking because SVM is much better at generalization capa-
of maps. There are other map features, but our method bility. Oku et al. [15] showed validness of applying
of extraction is also possible to deal with these features. SVM for modeling user preference, optimized feature

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1397

parameters of user context and restaurant for information 14%


object confirming
recommendation. In this work, they compared SVM with
the other methods such as Neural Network, k-Nearest path finding
19% 47%
Neighbor, Decision Tree, Bayesian Filtering, etc. As a
both two purposes
result, SVMs generalization capability were superior to
the other methods. Their features are similar to map other purposes
20%
features in that they are expressed as N-dimensional
vectors. Hence, classification by using SVM is usable for
features extraction. Figure 3. A preliminary experimental result of classification based on
each purpose
In our study, it was necessary to construct discriminant
surfaces that can classify maps. In order to find effective
features for map search, at first, we propose many features gourmet map, and shopping map. Then, we asked 6
that may affect the ranking of maps. Second, we conduct participants to select the usability of each map; i.e., the
experiments to classify the maps as usable or non-usable participants were required to state whether they would use
based on these candidate features. Finally, we extract a certain map for object confirming, for path finding or for
necessary features to rank the maps in these features. the other such purpose. We regarded the maps that were
judged to be usable by more than half of the participants
V. E XPERIMENT as usable. Figure 3 shows the results of the preliminary
We conducted an experiment to extract features empha- experiment. Most maps (86%) were classified as maps
sized by the users for measuring the usability of maps. usable for the purpose of object confirming, the purpose
The necessary features were estimated on the basis of of path finding, or both. Hence, we concluded that these
user purposes. Hence, the participants judged the usability two purposes were sufficient to reflect user requirements.
of maps on the basis of the following purposes in this In the experiment for features extraction participants
experiment: judged whether a map was usable or not for each purpose;
Object confirming we considered these responses the judgment data. We
We extracted features that affect usability when a extracted features emphasized by the users on the basis of
user wants to confirm objects. It is necessary for each purpose, by revealing relations between the judgment
users to confirm objects that exist at a certain po- data and the features contained in the maps.
sition. When users have not decided upon a specific
destination, they may choose the desired target ob- A. Experimental Procedure
jects. We assumed that a user planning to do some
sightseeing will require maps showing sightseeing First, we prepared 100 maps that described the Kyoto
spots such as Gion, Higashiyama, and Kiy- area; Kiyomizu Temple was described on all the maps. In
omizu Temple. We believe that maps that describe the case of this dataset, a certain object refers to Kiy-
these objects will be usable in this case. Maps that omizu Temple. Then, we prepared 50 maps that described
allow a user to understand the positional relations the San Francisco area; Union Square was described on
between objects are usable for object confirming. all the maps. In this case, a certain object refers to
Path finding
Union Square. The 20 participants that took part in this
We extracted features that affect usability when a experiment were all university students. After viewing all
user wants to find a certain path. It is important for the maps of the Kyoto area, we asked 10 participants
users to confirm the correct path to their destination. to classify the 100 maps as usable or non-usable as a
When users have already decided their destination, response to each of the following statements:
they may desire maps that provide route information. Select usable maps for understanding the positional

We assumed that a user planning to go to Kiyomizu relation between Kiyomizu Temple and the other
Temple by train or on foot will require the positions sightseeing spots.
of stations or the names of streets. Therefore, maps This situation corresponds to the purpose of object
that make it easy for the user to recognize the confirming.
position of transportation facilities and routes to Select usable maps for understanding how to go to

the destination will be usable in this case for path Kiyomizu Temple.
finding. This situation corresponds to the purpose of path
The two above-mentioned purposes are not exclusive. finding.
Hence, there also will be usable maps that can serve both On the other hand, after viewing all the maps of the
these purposes. San Francisco area, we asked the other 10 participants to
We conducted a preliminary experiment for verifying classify the 50 maps as usable or non-usable as a response
the validity of these purposes. First, we collected 200 to each of the following statements:
maps by using an image search engine. These maps have Select usable maps for understanding the positional
a wide variation because we retrieved them by using relation between Union Square and the other sight-
a variety of search queries such as sightseeing map, seeing spots.

2012 ACADEMY PUBLISHER


1398 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE III.
This situation corresponds to the purpose of object A N EXAMPLE OF DATA OF FEATURES
confirming.
Select usable maps for understanding how to go to feature value
Union Square. 1 the coordinate of northern edge 0.53
This situation corresponds to the purpose of path 2 the coordinate of southern edge 0.57
3 the coordinate of eastern edge 0.16
finding. 4 the coordinate of western edge 0.53
Table III shows the features and values used in the 5 the area in real world 0.47
experiment. These values were normalized by using the 6 scale ratio 0.06
7 the number of objects 0.74
maximum values of each feature. The considered map 8 the number of landmark objects 0.86
had 50 features (29 geographical features and 21 image 9 the ratio of landmark objects 0.57
features) . Further, labels of T or F were added to denote 10 the number of path objects 0.53
whether the map was usable or non-usable with respect 11 the ratio of path objects 0.13
12 the number of edge objects 0.20
to each purpose. The map considered in the example had 13 the ratio of edge objects 0.04
a label of F for the purpose of object confirming, and a 14 the number of district objects 0.00
label of T for the purpose of path finding. 15 the ratio of district objects 0.00
In the experiment, we regarded maps judged to be 16 the number of node objects 0.73
17 the ratio of node objects 0.34
usable by more than half of the participants as usable. 18 dispersion of x-coordinates 0.38
For the classification of the maps of the Kyoto area, we 19 dispersion of y-coordinates 0.57
used 60 maps as the training data and the other 40 maps 20 dispersion in an image 0.28
21 dispersion of latitudes 0.03
as test data. The test maps were randomly selected from 22 dispersion of longitudes 0.04
the 100 maps. On the other hand, for the classification 23 dispersion in real world 0.00
of the maps of the San Francisco area, we used 30 24 x-coordinate of a certain object 0.66
maps as the training data and the other 20 maps as the 25 y-coordinate of a certain object 0.68
26 the distance from center to a certain object 0.34
test data. We constructed a classification model using 27 cardinal direction 0.26
the set of training data. To construct the classification 28 the information for route guidance 0.00
model, we used the LIBSVM with basic parameters for 29 the other information 0.00
30 the number of pixels in a row 0.39
the python programming language [16]. We classified the
31 the number of pixels in a column 0.41
experimental data by using all combinations of less than 32 the number of pixels 0.18
three features from the 50 features, and determined the 33 the overall mean of red color 0.84
features effective for ranking the maps. In the next section, 34 the mean of red color around a certain object 0.29
35 the difference of red color 0.62
we first introduce the features extracted on the basis of 36 the overall mean of green color 0.81
each purpose for the maps of the two considered areas 37 the mean of green color around a certain object 0.64
and then discuss the features effective for each purpose. 38 the difference of green color 0.27
39 the overall mean of blue color 0.61
40 the mean of blue color around a certain object 0.45
B. Experimental Results 41 the difference of blue color 0.17
In this section, we will introduce the features extracted 42 the overall mean of hue 0.40
43 the mean of hue around a certain object 0.67
by using the proposed method. In the experiment, we 44 the difference of hue 0.37
focus on the top 100 sets out of all 20875 sets, which 45 the overall mean of saturation 0.70
have the highest classification precision. The experimental 46 the mean of saturation around a certain object 0.31
47 the difference of saturation color 0.39
results revealed that the participants focused on geograph-
48 the overall mean of brightness 0.88
ical features rather than image features. In particular, 49 the mean of brightness around a certain object 0.29
geographical features related to the appearance of the 50 the difference of brightness color 0.64
objects were emphasized for each purpose. These features
seem to be important for determining the usability of
maps. We believe that users first consider what objects
are described in maps rather than how they are described. was extracted. Further, many sets contained both a
We also extracted the sets containing both of geographical geographical feature and an image feature (51/100).
features and image features. Features extracted on the Object confirming on the maps of the San Francisco
basis of each purpose in the case of the maps of the two area
considered areas are as follows: All the sets contained the number of landmark
Object confirming on the maps of the Kyoto area objects (100/100), as in the case of object confirm-
Many sets contained the number of landmark ob- ing on the maps of the Kyoto area. Further, many
jects (47/100) or the number of objects (47/100). sets contained the number of pixels in a column
Maps containing a number of objects were often (41/100). This feature was extracted for the same
classified by the participants as usable maps. More- reason as the feature of number of landmark objects
over, the number of path objects (32/100) was since there was a good correlation between the two
probably extracted for the same reason. Besides these features. In addition, most sets contained both a
features, y-coordinate of a certain object (41/100) geographical feature and an image feature (79/100).

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1399

Figure 4. Usable maps for each purpose (left: object confirming, right: path finding)

100
90 Kyoto
80 San Francisco
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Geographical Features Image Features


Figure 5. Results of features extraction for object confirming

100
90 Kyoto
80 San Francisco
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Geographical Features Image Features


Figure 6. Results of features extraction for path finding

Path finding on the maps of the Kyoto area Most sets contained the feature of the ratio of
A large number of sets contained the ratio of path objects (94/100). This feature provided the
node objects (72/100). This feature allowed users to user with route information to the destination. The
know the available modes of transport. In addition, number of objects (39/100) was also extracted.
many sets contained the ratio of landmark objects Moreover, many sets contained both a geographical
(37/100). It is estimated that maps containing a feature and an image feature (63/100).
few landmark objects were often classified by the
Figure 4 shows examples of maps classified as usable
participants as usable maps because users probably
for each purpose. In the left map, many landmark objects
need objects to know how to get to the destinations
were described. In addition, there are some path objects.
rather than general objects like landmark objects.
Hence, the value of the number of objects was very
Besides these features, the coordinate of northern
high. This map was classified as usable for the purpose
edge (32/100) was extracted. Moreover, some sets
of object confirming. In other words, it was possible to
contained both a geographical feature and an image
confirm positional relations between objects by using this
feature (40/100).
map. In the right map, many path objects and node objects
Path finding on the maps of the San Francisco area
were described in addition to a few landmark objects.

2012 ACADEMY PUBLISHER


1400 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE IV.
T HE FEATURES CORRESPONDING TO EACH PURPOSE

object confirming path finding


feature Kyoto SF effectiveness Kyoto SF effectiveness

1 the coordinate of northern edge 2 3 32 0
2 the coordinate of southern edge 1 5 0 0
3 the coordinate of eastern edge 1 2 8 0
4 the coordinate of western edge 2 3 3 0
5 the area in real world 2 2 8 0
6 scale ratio 2 3 2 1
7 the number of objects 47 2 4 39
8 the number of landmark objects 47 100 0 13
9 the ratio of landmark objects 0 1 37 12
10 the number of path objects 32 10 6 1
11 the ratio of path objects 0 0 2 94
12 the number of edge objects 5 2 27 2
13 the ratio of edge objects 1 2 0 0
14 the number of district objects 2 1 2 0
15 the ratio of district objects 5 7 2 7
16 the number of node objects 8 0 7 5
17 the ratio of node objects 2 3 72 7
18 dispersion of x-coordinates 5 2 0 2
19 dispersion of y-coordinates 1 2 0 3
20 dispersion in an image 3 2 0 1
21 dispersion of latitudes 5 2 1 0
22 dispersion of longitudes 3 4 4 2
23 dispersion in real world 5 3 5 0
24 x-coordinate of a certain object 2 3 2 1
25 y-coordinate of a certain object 41 4 1 7
26 the distance from center to a certain object 2 3 2 4
27 cardinal direction 3 2 0 1
28 the information for route guidance 3 1 0 14
29 the other information 4 3 2 2
30 the number of pixels in a row 4 5 0 6
31 the number of pixels in a column 7 41 1 0
32 the number of pixels 3 3 1 2
33 the overall mean of red color 4 2 1 3
34 the mean of red color around a certain object 2 2 1 1
35 the difference of red color 5 1 2 0
36 the overall mean of green color 1 2 1 4
37 the mean of green color around a certain object 2 4 2 6
38 the difference of green color 4 2 0 2
39 the overall mean of blue color 1 1 0 2
40 the mean of blue color around a certain object 3 1 11 12
41 the difference of blue color 2 2 0 13
42 the overall mean of hue 4 14 1 4
43 the mean of hue around a certain object 0 0 3 1
44 the difference of hue 1 1 0 6
45 the overall mean of saturation 1 2 8 4
46 the mean of saturation around a certain object 0 2 4 2
47 the difference of saturation color 1 3 5 3
48 the overall mean of brightness 4 2 1 0
49 the mean of brightness around a certain object 4 7 0 6
50 the difference of brightness color 2 4 0 0

This map was classified as usable for the purpose of effective in the case of both the considered areas. In
Path finding. The node objects were described along the addition, the number of objects and number of path
subway lines denoted by the colored lines. By using this objects were emphasized in the maps of the Kyoto area.
map, a user could obtain route information if he wanted On the other hand, number of pixels in a column was
to travel by train. The user could obtain route information emphasized on the maps of the San Francisco area. This
also by using the path objects described in the map. implied that maps containing a number of objects were
often classified by the participants as usable maps. It was
Figure 5 shows the results of an effective features inferred that the representation of many objects on a map
extraction for the purpose of object confirming. The helped users to understand the positional relations be-
horizontal axis shows the serial number assigned to each tween the objects. Hence, we believe that the number of
feature illustrated in table III. The vertical axis shows the landmark objects was important for the purpose of object
number of sets containing each feature in the top 100 confirming. Furthermore, the other features depended on
sets. The number of landmark objects was particularly

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1401

the area considered. Input form of objects name Map display


Figure 6 shows the results of the effective features ex-
traction for the purpose of path finding. Different features
were emphasized for different areas for the purpose of
path finding. The ratio of node objects was the most
important feature in the maps of the Kyoto area. On the
other hand, the ratio of path objects was particularly
effective in the case of the maps of the San Francisco
area. Both these features contained information regarding
modes of transport such as by road or rail. It was
estimated that maps containing a number of paths or
nodes were often classified by the participants as usable
maps because users probably require objects to know how
to reach their destinations. Hence, we believe that the Candidates of maps: select channel of maps
ratio of node objects and the ratio of path objects
were important for the purpose of path finding. The other Figure 7. Interface of map search engine
features depended on the area considered.
Finally, we concluded the effective features for each
purpose. Table IV shows features corresponding to each VI. C ONCLUSIONS
purpose. In order to select effective features, we consid-
We proposed a map search engine that show ranking of
ered the features that contained within more than 30 sets.
maps on the basis of user intention. In this search engine,
The features for the purpose of object confirming are as
it is important that the retrieved maps must match the
follows:
user purposes because usability of maps are depending on
the number of landmark objects
user purpose. In order to interpret map contents, features
the number of objects
are necessary to reflect user requirements. Hence, we
the number of path objects
defined map features consisting of two types of features:
y-coordinate of a certain object
geographical features and image features. Although maps
the number of pixels in a column
have a wide variety of features, excessive features may
For the purpose of object confirming, objects were used have harmful effects for ranking. Hence, we extracted
for understanding positional relations between objects. effective features for a map search engine by using an
Therefore, maps containing a number of objects were SVM. We consider two categories of purposes as user
important, and features related to the number of objects requirements: object confirming and path finding. In the
were extracted. obtained results, geographical features were particularly
On the other hand, the features extracted for this emphasized. In particular, the number of landmark ob-
purpose are as follows: jects were effective for object confirming. On the other
the coordinate of northern edge hand, the ratio of path objects and the ratio of node
the number of objects objects were important for path finding. We concluded
the ratio of landmark objects these features depend on each purpose. Furthermore, we
the ratio of path objects extracted some features depend on each area. In addition,
the ratio of node objects various image features were also extracted. These features
For the purpose of path finding, on the other hand, will be effective by unifying some features.
objects were used to know how to reach the destination. In the future, we intend to develop a search engine for
Therefore, features related to path objects or node objects the retrieval of maps by using features extracted by the
were extracted for obtaining the route information or proposed method. Figure 7 shows an assumed interface
information of the available modes of transport. of the map search engine. Our idea is based on relevance
In addition to these results, an improvement of the feedback. In this system, a user first enters the object
features may positively affect the classification of the name as a query to input form of objects name. Second,
maps. In the experiment, we classified all the objects on the system presents candidates for maps that may have
the basis of five elements that make up a city. However, the required information. The candidates represent object
if we classify the objects into categories such as shops, names inputted by the user. Third, the user can select
restaurants, and temples, on the basis of the purpose some usable maps that match their purpose. The system
that these objects serve, we may be able to consider estimates the users requirements from selected maps. Fi-
a relatively large number of user requirements. On the nally, the system presents new candidates of the maps by
other hand, various image features related to a certain ranking on the basis of users map selection. It is possible
object were extracted in the experiment. This reflected to retrieve better maps interactively because search query
the importance of showiness of a certain object. Hence, are improved whenever new maps are selected by user.
we can obtain more positive results by unifying some We plan to evaluate results searched by using extracted
features related to the showiness of a certain object. features for relevance feedback.

2012 ACADEMY PUBLISHER


1402 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

In this paper, we considered two areas: Kyoto and San Conference on Advances in Geographic Information Sys-
Francisco. However, it may be not enough to analyze tems, 2010, pp. 526527.
features depending on a certain area. Hence we plan [13] K. Oku, S. Nakajima, J. Miyazaki, S. Uemura, and
to conduct experiments considering many more areas. H. Kato, A ranking method based on users contexts for
information recommendation, in Proc. of 2nd Interna-
In addition, we have to analyze maps deformations. tional Conference on Ubiquitous Information Management
Kitayama et al. [17] noted that modified maps include and Communication (ICUIMC 2008), 2008, pp. 289295.
incorrect information, but incorrect information should be [14] K. Lynch, The Image of the City. The MIT Press, 1960.
allowed when a map is modified by good deformations. [15] K. Oku, S. Nakajima, J. Miyazaki, and S. Uemura,
Thus, there is trade-off between accuracy and deformation Context-aware recommendation system based on context-
for certain purposes. In future work, ranking of maps dependent user preference modeling, IPSJ Transactions
on Databases, vol. 48, no. 11, pp. 162176, 2007.
should be improved with consideration to the trade-off.
[16] Libsvm a library for support vector machines, http:
//www.csie.ntu.edu.tw/cjlin/libsvm/.
R EFERENCES [17] D. Kitayama, R. Lee, and K. Sumiya, Deformation anal-
[1] Google maps, http://maps.google.com/. ysis based on geographical accuracy and spatial context
[2] Bing maps, http://www.bing.com/maps/. for modified maps credibility, in Proc. of 44th Hawaii
[3] K. Kobayashi, R. Lee, and K. Sumiya, Systematic mea- International Conference on System Sciences (HICSS-44),
surement of human map-reading ability with street-view 2011, pp. 19.
based navigation systems, in Proc. of 4th International
Conference on Ubiquitous Information Management and Junki Matsuo is a student of Graduate School of Human
Communication (ICUIMC 2010), 2010, pp. 286293. Science and Environment, University of Hyogo. His research
[4] C. Cortes and V. Vapnik, Support-vector networks, Ma- interests include geographic information systems. He is a student
chine Learning, vol. 20, no. 3, pp. 273297, 1995. member of Information Processing Society of Japan and the
[5] H. Honda, K. Yamamori, K. Kajita, and J. Hasegawa, A Database Society of Japan.
system for automated generation of deformed maps, in
Proc. of the IAPR Workshop on Machine Vision Applica- Daisuke Kitayama is an Assistant Professor at Human Science
tions (MVA 1998), 1998, pp. 149153. and Environment, University of Hyogo. His research interests
[6] K. Fujii and K. Sugiyama, Route guide map generation include multimedia database, data analysis, and integration of
system for mobile communication, Transactions of Infor- heterogeneous media. He received a Ph.D in human science and
mation Processing Society of Japan, vol. 41, no. 9, pp. environment in 2009 from University of Hyogo. He is a member
23942403, 2000. of Information Prepossessing Society of Japan and Database
[7] M. Agrawala and C. Stolte, Rendering effective route Society of Japan.
maps: Improving usability through generalization, in
Proc. of 28th Annual Conference on Computer Graphics Ryong Lee is an Expert Researcher at National Institute of
and Interactive Techniques (SIGGRAPH 2001), 2001, pp. Information and Communications Technology (NICT). His re-
241249. search interests include geographic information systems, social
[8] T. Osaragi and S. Onozuka, Map element extraction network analysis and web information retrieval systems. He is
model for pedestrian route guidance map, in Proc. of 4th a member of ACM and Institute of Electronics, Information
IEEE International Conference on Cognitive Informatics and Communication Engineers. He received a Ph.D. and a M.S.
(ICCI 2005), 2005, pp. 144153. from Graduate School of Social Informatics, Kyoto University
[9] F. Grabler, M. Agrawala, R. W. Sumner, and M. Pauly, in 2003 and worked as a senior researcher at Samsung Advanced
Automatic generation of tourist maps, in Proc. of 35th Institute of Technology until June, 2008.
International Conference on Computer Graphics and In-
teractive Techniques (SIGGRAPH 2008), 2008, pp. 111. Kazutoshi Sumiya is a Professor at Human Science and Envi-
[10] M. Michelson, A. Goel, and C. A. Knoblock, Identifying ronment, University of Hyogo. He received the Ph.D. degree in
maps on the world wide web, in Proc. of 5th International engineering in 1998 from the Kobe University Graduate School
Conference on Geographic Information Science, 2008, pp. of Science and Technology. His research interests include multi-
249260. media database and data broadcasting. He is a member of IEEE
[11] Y. Y. Chiang and C. A. Knoblock, Classification of raster Computer Society, ACM, the Institute of Image Information
maps for automatic feature extraction, in Proc. of 17th and Television Engineers, Information Processing Society of
ACM SIGSPATIAL International Conference on Advances Japan and the Database Society of Japan and the Institute of
in Geographic Information Systems, 2009, pp. 138147. Electronics, Information and Communication Engineers.
[12] S. Newsam, D. Leung, O. Caballero, J. Floreza, and
J. Pulido, Cbgir: content-based geographic image re-
trieval, in Proc. of 18th ACM SIGSPATIAL International

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1403

Achieving Dynamic and Distributed Session


Management with Chord for Software as a
Service Cloud
Zeeshan Pervez, Asad Masood Khattak, Sungyoung Lee , Young-Koo Lee
Ubiquitous Computing Lab, Kyung Hee University, Yongin, Korea
Email: {zeeshan, asad.masood, sylee}@oslab.khu.ac.kr, yklee@khu.ac.kr

Abstract Cloud computing - started as a buzz word is to be inevitable in previous ones. Exponential increase
rapidly embraced by the enterprises and preached by in processing power of enterprise servers, adoption of
the technological evangelist. Availability of high bandwidth virtualization and availability of high bandwidth to the
internet at the end user level, and the adoption of vir-
tualization for efficient resource utilization by the data- end user have given birth to a new type of computing
center management has given birth to this new computing paradigm known as cloud computing [2]. It encompasses
paradigm. It promises colossal on-demand processing and Software as a Service (SaaS), Platform as a Service
storage capacity along with scalable service delivery model. (PaaS), and Infrastructure as a Service (IaaS).
Software solution providers are applying cloud computing to Among the cloud computing stack, SaaS is a software
reduce service provisioning cost, by providing their business
functionality as a service. However, it requires modification delivery model, which provides access to business func-
in context of how existing services are provisioned. Existing tionality remotely as a service [3]. Leading companies
session management policies require dedicated computing in Information Technology industry are gradually moving
resources to process sessions; this deviate from the concept their applications and related data over the internet and
of Pay-As-You-Use. To conform to cloud computing ar- delivering them through SaaS [4]. Google has used SaaS
chitecture there is need to decouple session management
from the provisioned services. Derived by the need of on- platform to offer web applications for communication and
demand service provisioning in this paper we present a collaboration [5], gradually replacing recourse exhaustive
decentralized session management framework inspired by desktop applications. Similarly, Microsoft is offering their
P2P routing protocol. We call the proposed framework development and database services though SaaS called
Chord based Session Management Framework for Software Microsoft Azure [6]. SaaS is preached by companies
as a Service Cloud (CSMC). Applying CSMC eliminates the
need of separately deploying computing resources for ses- like SalesForce, 3Tera, Microsoft, Zoho and Amazon,
sions management, in fact CSMC uses existing least utilized as a result of which business specific services can be
resources within Cloud Area Network (CAN). CSMC tested consumed in ubiquitous environment.
on three different cloud configurations highlights the fact One of the distinguishing features of cloud computing
that CSMC can be effectively deployed in cloud to achieve is the adoption of virtualization [2], that helps service
seamless service scalability. Additionally, we have tested
CSMC on different web-servers to highlight its efficacy of providers to provision their services on-demand basis.
session management on varied cloud infrastructure. These services encompass software (business application)
and data storage services or even hardware and network
Index Terms Session Management, Algorithms, Manage-
ment, Measurement, Performance resources as a service [7]. The concept of pay-as-you-
use is principally backed by virtualization. Although on-
demand service (service scaling) is just a matter of simple
I. I NTRODUCTION click as advertised by most of the cloud hosting providers
[8]. But in fact there are lots of issues related to this
Software delivery models have evolved over the time:
simple click event; session management, virtual machine
from stand alone applications to client server architec-
deployment, and load balancing are few of them. Services
ture and from distributed to service oriented architecture
are scaled (up or down) to comply with service level
(SOA) [1]. All of these transformations were intended to
agreement (SLA) signed between the service provider and
make business process execution effectual and to provide
service consumer or to reclaim the resources when they
ease of use. New software delivery models emerge due to
are not required (less number of concurrent users).
the fact that either the earlier delivery models were not
SaaS can be classified in four levels [9]; at the highest
supporting the business needs or technological advance-
level (Level-IV) services are multi-tenant and config-
ment have broken some barriers which were considered
urable. Multi-tenant services are developed keeping in
This paper is based on CSMC: Chord based Session Management view the heterogeneity of service consumers. Diverse
Framework for Software as a Service Cloud by Z. Pervez, A.M. consumers can subscribe to same instance of a service, yet
Khattak, S.Y. Lee, and Y.K Lee, which appeared in the Proceedings they will experience bespoken response according to their
of ICUIMC 11, February 21-23, 2011, Seoul, Korea. c 2011 ACM.
This research was fully supported by Microsoft Research Asia. business requirements. Level-IV services are most lucra-
Professor Sungyoung Lee is the corresponding author of this paper. tive for any service provider, since they only need to spend

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1403-1412
1404 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

once on development process and later on these services


Session
can be configured according to customer requirements. Main Memory Repository
However, provisioning these types of services demands
some tailored management procedures in terms of session
and load balancing. (a) (b)
Session management procedures which are currently
Worker Node 1 Worker Node 2 Worker Node n
deployed by hosting providers works well in web archi- (Web Server) (Web Server) (Web Server)
tecture (Client Server), where there is no concept of pay-
as-you-use, and resource utilization is not considered at
the utmost priority. Existing session management algo-
rithms used by most of the web-servers are developed
by the hypothesis that resources (compute and storage
servers) will be available throughout their lifecycle. Al-
though web-servers provide disaster recovery mechanism
by replicating session information on multiple servers,
nevertheless they are not adaptive to true dynamic nature Load Balance
of cloud, in which resource can be added or removed with (c)
a single command on cloud management console.
Deploying service in cloud using these conventional Figure 1: Web-Server Session Management Methodolo-
session management procedures increases service pro- gies
visioning cost as they demand dedicated resources to
process and store session information. Apart from that
these session management procedures also hinder in the achieve high throughput, and to cater flash crowd prob-
development of Level-IV services, as sessions are bound lem. Besides this with the emergence of Web 2.0 interac-
to particular instance of a web-server, restricting con- tive application concept, these web-servers are configured
sumers to one instance of a server. In this paper we present to execute numerous HTTP Post requests generated by
a decentralized session management algorithm which is an individual client application. Various strategies have
not bound to any particular web-server. Decoupling of been adopted by these web-servers to provide desktop
session management helps in provisioning on-demand application like experience in web applications, which led
services and reclaiming cloud compute resources when them to session management in various ways.
they are not required to reduce service provisioning cost; Mainly sessions are handled by applying three distinc-
without affecting existing active sessions. We use P2P tive methodologies, subject to application requirements.
routing protocol (i.e., Chord) to distribute session values These requirements include number of concurrent users,
among the cloud compute resources. With P2P routing session validity period, and inter arrival time of a request.
protocol, resources are efficiently utilized without the Apart from that, usage of session in application also
need to have a dedicated session management server. influences the decision of an application architecture in
The rest of the paper is organized as in Section II we selecting the suitable session management methodology.
will discuss the different session management method- Below we have discussed three different session manage-
ologies provided by the existing web-servers. Section ment procedures used in most of the web-servers.
III will summarize some of systems in which Chord is
successfully utilized to develop distributed applications. A. Main Memory Based Session Management (MMB)
Section IV will present our proposed distributed session Main Memory Based Session Management (MMB) is
management framework. Section V will talk about the frequently used and is enabled by default in most of the
session management procedure using Chord. Section VI web-servers [12], [10], [11]; persists session information
will outline our test bed used for experiments, along with in worker process of the web-server shown in Figure 1(a).
implementation strategy. In Section VII we will present This strategy is best suited for an application which has a
our results in three different configurations. Section VIII limited number of concurrent users. Whenever, session
will present the future work and finally in Section IX we information is required web-server can extract it from
will conclude our work. the worker process. In this methodology session state
depends on the life time of the application; if applica-
II. R ELATED W ORK tion is restarted all of the active sessions are lost. This
methodology works well for applications where session
Web applications and services are hosted on web-
is not intensively used in business logic to persist data.
servers to deliver their contents and functionality to the
end user. There is an exhaustive list of web-servers used
by the industry to provision web applications and services. B. Repository Based Session Management (RB)
Three well known and commonly in use web-servers are Repository Based Session Management (RB) is ap-
Internet Information Services [7], Apache Tomcat [10], plied to medium size applications [12], [10], [11]. It
and GlassFish [11]. These web-servers are designed to persists sessions in dedicated database or file, called

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1405

session repository, shown in Figure 1(b). This approach III. C HORD IN D ISTRIBUTED S YSTEM
is used in applications where sessions are rigorously used
to store business objects when navigating between web Chord [13] is a lookup protocol dedicated to internet
pages. With this approach session validity period can applications that need to discover any type of resources
be increased to a much longer duration as compared to maintained by the users that form an underlining network
MMB. In addition to that, it also facilitates application [14]. It provides an elementary service: for a given key,
developer to persist entire business object in session with- Chord returns a node identifier that is responsible for
out compromising application response time. It also trims hosting or locating the resource. Chord has been deployed
down the usage of main memory and takes advantage in several applications: CFS (Collaborative File System)
of database algorithms for searching and indexing huge [15] which is an internet scale distributed file system, and
repository of active sessions. ConChord [16] which uses CFS to provide a distributed
framework for the delivery of SDSI (Simple Distributed
Security Infrastructure) security certificates.
C. Dedicated Machine Based Session Management Some applications employ Chord in a much different
(DMB) way as compared to file sharing applications. Snapshot
Third methodology is applied for massively large ap- [17] is a distributed network management algorithm de-
plications. It is best suited for the applications [12], [10], veloped on the basis of Chord. This management scheme
[11], where sessions are persisted on multiple locations in helps telecommunication carries to gather information
order to avoid any failure and to achieve load balancing about the current performance capabilities of their net-
in case if there are too many hits on a web-server. work. Besides this it also assists them in monitoring entire
Figure 1(c) shows DMB topology, consisting of one or subset of the network. Each subset of the network
Load Balancer (master node) and multiple worker nodes creates the snapshot of the underlying network which is
(web-server). This approach is mainly adopted by enter- then used to identify the point where counter measures
prise applications where sessions are created for a much are required.
longer duration of time and must be persisted to increase Network heterogeneity is another problem which can
user experience, and to reduce the dependency on other affect the response time of a lookup query in Chord
components (in case if business object are not subject network. Not all participating nodes possess the same
to frequent changes). This approach requires dedicated processing power and network bandwidth. [18] is another
resources for session management. Usually scheduling file sharing variant of Chord which addresses network
algorithms is deployed on a master node that routes the heterogeneity. To overcome this problem they proposed
incoming requests to an appropriate web-server. Apart an improved Chord model, based on Topic-Cluster and
from that, this technique of session management requires Hierarchic Layer (HTC-Chord). The proposed algorithm
replica of session repository on each web-server. divides the network according to the interest (Topic) and
IIS, Tomcat, and Glassfish are shipped with these three processing capabilities of the available nodes. Through
session management approaches described earlier, with a this scheme, lookup request is restricted to a subset of
few variations. IIS use Microsoft SQL Server for Repos- nodes, in which the nodes have same interests. As a result
itory Based session management whereas; Glassfish use of which, response time of a request is reduced since it
local file system to persist session information instead of is only routed to the nodes having similar interest and
dedicated database. Tomcat use FileStore as an alternative possess appropriate processing power.
of main memory, for every new session a separate file [19] proposed a key look strategy based in Power Law.
is created in FileStore that persist session information. They have introduced the concept of Super Node, which
However, for the DMB approach all of these web-servers possess huge processing and high bandwidth availability.
employ same strategy, at master node load balancer is Super Node works as an anchor node, instead of diving
deployed and actual sessions are persisted on multiple deep in Chord network, request are entertained by the
worker nodes. Super Nodes, avoiding nodes which has less processing
Existing web-servers provide session management pro- capabilities.
cedures explicitly engineered keeping in view the web
architecture. Cloud computing preaches on-demand vir-
tualized services which can scale accordingly to their IV. S YSTEM A RCHITECTURE
utilization requirements. There is need of session manage-
ment procedure that can scale with services. Making use CSMC is a Chord based session management frame-
of dedicated compute nodes for session management will work for Software as a Service cloud (SaaS), which
restrict service provider to provision services on-demand. provides distributed session management enabling session
P2P algorithms are well known for their scalability and decoupling. CSMC enables services providers to achieve
distributed nature. A lot of literature has been published seamless service scalability, without interfering the pro-
on P2P routing protocols. Chord is a P2P routing protocol cessing of existing active sessions. The component stack
which has been successfully used in various application of CSMC consists of six managerial components shown
to achieve scalability. in Figure 2.

2012 ACADEMY PUBLISHER


1406 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Cloud Route request to resource manager


Gateway
Request Propagation in Cloud .

Resource Select least utilize resource


Manager

Node Manager Provide usability stats

Chord Manager Session lookup and key management

Session
Session management validation / creation
Web Manager

Cloud Node Server Service


Manager Service provisioning
id: 27

Figure 2: CSMC Component Stack

A. Cloud Gateway (CGW) Service provisioning is the core function of every node.
The top most layer of CSMC is CGW, it works as In order to avoid the situation where too many request are
an entry point in SaaS. Every service provisioned by routed to the same node; each individual Node Manager
the cloud is accessible through CGW. Applications con- constantly examines the processing capacity of the node,
suming cloud hosted services will have no idea about and update the Resource Manager.
the underlying component stack, for them CGW is the
service provider. This high level of abstract is very useful D. Chord Manager
during service scaling. As client applications are only Chord Manager is the core component in CSMC, entire
interacting with CGW, there is no need to change service Chord related functions are handled by Chord Manager.
binding when multiple instances of a service are deployed. Functions like key value lookup (session identifier), finger
Internal components of CSMC will automatically route table scan, and request forwarding are handled by it.
the clients request to the most appropriate instance. In short, Chord Manager is responsible for distributed
Underlying components of CSMC stack will ensure that session management. In CSMC every node is responsible
services are provisioned accordingly to the signed SLA. for storing fraction of the active sessions. A unique
chord identifier is assigned to each compute node by the
B. Resource Manager Resource Manger. Sessions are allocated to each com-
Effective resource utilization is one of the key selling pute node according to its chord identifier. In interactive
point of cloud computing. Resource Manager is the com- applications, session management is one of the prime
ponent which has the global view of resource utilization concerns of application developers, as well as for the
in cloud. In order to avoid bottlenecks, Resource Manager service providers. As applications are becoming more
constantly routes the requests to least utilized resources. It and more interactive, sessions are intensively used by
works like a bridge between CGW and actual computing the application developers to persists business objects
resources in cloud. Consequently, CGW does not need during HTTP Post request [20] or in case of partial call
to handle request forwarding task instead it has been be back operation (AJAX) [21]. Chord Manager together with
delegated to the Resource Manager. This enables CGW Session Manger help hosted services to validate session
to manage incoming requests while Resource Manager authenticity and provide desired session information.
governs selection of the most appropriate service instance.
E. Session Manager
C. Node Manager For every legitimate user new session is created if it
Node Manger assists Resource Manager in selecting does not exist or if its validity period has been expired,
the appropriate service instance depending on clients by session manager. Since HTTP is a stateless protocol,
SLA. It periodically updates information about resource session management is the most efficient mechanism to
utilization to Resource Manger. It monitors the worker persist business objects while navigating between the web
thread of web-server to analysis the response time of the pages. Apart from that, session is also used to validate
node. Each node in CSMC enabled cloud is deployed user. Every session is valid for a particular period of
to provide two primarily functions; first is service provi- time after that it is discarded. Importance of session
sioning and second is Chord based session management. management is escalated if the client application is an

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1407

interactive application, which needs quicker response as


Service Consumption Request
compare to conventional web applications. Besides this, as
web applications are providing functionalities with desk-
top like experience, more and more business objects are Cloud
Gateway
persisted in session variable that demands more memory
space and reduced lookup time. Resource
Manager
Select least utilize resource
F. Service Manger Select least utilize resource

In SaaS architecture single compute node provides mul- 15 0 1

tiple services though virtualization. In order to identify 14 2 CSMC


Component Stack

their usage pattern Service Manger is added. Service Node utilization


Node Manager
statistics
Manger keeps track of all of the hosted services on a 13
Chord Manager

single compute node. Service Manger assists SLA manger Session Manager
Service Manager
in deciding which service should be scaled for effective
resource utilization. It also assists Node Manager in 12 4

analyzing worker process of the web-server that helps in


reducing session lookup time. 11
5
Collectively, all six managerial components of CSMC
facilitates in achieving distributed session management 10
6

in cloud, driven by the need of cost effective resource 9


utilization. With CSMC, there is no need of dedicated 8 7

session management components in cloud which increases


service provisioning cost and can become a bottleneck in
Request Routing but Cloud Gateway (CGW) to the least busy node in Cloud
case of flash crowd.
Request forwarding by Cloud Nodes to its successor
Provide statistics about the node resource utilization
V. S ESSION M ANAGEMENT IN CSMC
In cloud computing services are scaled according to Figure 3: CSMC in Cloud Area Network
the number of concurrent users/requests. [9] describes
four level of service provisioning models, at the highest
level (Level - IV); services are scalable, configurable and service instance according to the SLA. At very abstract
possess the multi-tenant property. To achieve true multi- level Resource Manger performs the request routing but
tenancy, there is a need to decouple session management internally it segregate the request according to the SLA
from a particular web-server. In cloud, services are pro- and select the resource (compute node / service instance)
visioned on virtualized resources (Virtual Machines) and which is most suitable for conforming the SLA. This
these resources can be reclaimed back if not required type of resource selection requires resource utilization
or more virtualized resources can be added if necessary. information, which can provide information about current
In this context availability of web-server is depended on processing capabilities of a compute node. In CSMC this
number of concurrent users. To achieve seamless service utilization information is provided by the Node Manager,
scaling (up or down) there is need of session decoupling which periodically intimate Resource Manger about its
so that execution of concurrent session is not disrupted processing capabilities.
and newly session can be created seamlessly. On receiving service usage request, individual service
With CSMC we have accomplished true session de- needs session information in case of HTTP Post request.
coupling with the hosted services. CSMC enables service In CSMC enabled cloud sessions are not bound to any
providers to scale their services without making any particular instance of a service; in fact sessions are bound
changes in the underlying configuration. Figure 3 shows to compute nodes according to the session identifier. To
CSMC topology in Cloud Area Network (CAN). locate session within CAN, session identifier is utilized
CSMC works in a collaborative manner. Every compo- which indicates the node responsible for maintaining the
nent of CSMC provides assistance to other component. session information. Session information is retrieved from
As a result single point of failure is avoided and also this CAN though Chord in logarithmic time 12 LogN , where
type of disseminated strategy is best suited for flash crowd N is number of compute nodes in CAN [13].
problem that demands additional compute nodes. CGW is Figure 4 shows Chord topology for CAN of 8 nodes
the point of interaction for every service consumer. The having maximum capacity for 16 compute nodes. It is
idea is similar as that of Service Oriented Architecture clear that current cloud capacity can be doubled without
(SOA), services are exposed without providing the internal requiring any configuration alteration in existing topology.
service composition logic. Every service consumer binds Black dots in Chord ring show the absence of compute
client application with CGW to consume the hosted nodes, whilst white circles show the actual compute
services. Received request is then delegated to Resource node on which request can be routed. On each compute
Manager which routes the request to the least utilized node CSMC component stack is deployed which helps in

2012 ACADEMY PUBLISHER


1408 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Service Consumption Request


Tomcat and GlassFish Server it is developed by using jdk
1.6.
CSMC components stack developed in .Net Framework
Cloud
Gateway 4.0 is deployed on each compute node as a WCF web
service. One of the benefits we get from WCF is dy-
Resource
Manager namic service binding. Instead of having a predefined
Validate Session Id :1837241306
binding between the compute nodes, successor and finger
15 0 1 references, only generic binding is defined whose end
14 2
CSMC
Component Stack
points can be dynamically configured at the run time
Node Manager
accordingly. Whereas, for Java based component stack
13
Chord Manager dynamic binding is not possible and component stack
Session Manager
Service Manager Finger Table
has to load predefined binding between the successor
12 4
Session
Identifier
Node and finger references. However both implementations of
4 4
5 7
component stack support direct response to requester node
11 7 7 once session information is identified in CAN.
5
11 11 To notify Resource Manger about resource
10
6 utilization .Net based component stack uses Perfor-
9 CSMC
Component Stack
manceCounter class provided by System.Diagnostics
8 7
Node Manager
namespace, whereas Java component stack uses
Chord Manager java.lang.management package. Information about
Session Manager
Finger Table
resource utilization is send back to Resource Manager
Service Manager
Session
Identifier
Node periodically after every 15 seconds (but is configurable
8 10
according to the requirement).
9 10
10 10 Since we are dealing with dynamic environment, where
15 15 compute nodes can be added or removed from the cloud
Request Routing but Cloud Gateway (CGW) to the least busy node in Cloud depending on resource requirement. Whenever new com-
Request forwarding by Cloud Nodes to its successor
pute node is added to CAN, chord identifier is assigned to
Request forwarding by Cloud Nodes to appropriate location
it by Resource Manager, and session values are allocated
Figure 4: CSMC Session Lookup Request Propagation to it which falls within the range of assigned chord
identifier and its successors chord identifier. Once node
is added to CAN and sessions values are assigned to it,
maintaining Chord ring and intimating Resource Manager Resource Manager sends an update finger table request to
about its processing capability. This information helps compute nodes in CAN. Every compute node maintained
in automated request routing and service scaling. Each its finger table in a XML file. Each finger entry consists
compute node is connected to its successor in the Chord of chord identifier of a compute node and its IP address.
and additionally contains the finger table to route session Chord identifier is used in selecting the most suitable
lookup query to an appropriate node. node during session lookup, whereas IP address is used
to define the dynamic end points between the nodes for
.net framework based component stack.
VI. T EST B ED I MPLEMENTATION Session values are stored on individual nodes using
SQL Server 2008 but is not limited to database. CSMC
We tested CSMC for different number of compute
is configurable, depending on the requirement, file based
nodes 5, 9 and 20 represented by the modulo of 23 , 24
session management can be used simply by configuring
and 25 chord space respectively. The test bed consists of
CSMC component stack to a file based session manage-
a Cloud Gateway and a Resource Manager, and multiple
ment. Different type of session management policies are
compute nodes. Cloud Gateway and Resource Manager
handled by Session Manager, providing the abstraction
are deployed on Windows 7 Enterprise running on Intel
for querying the underlying session management policy.
Quad Core with 4 GB RAM and 360 GB hard drive.
Depending on the management policy Session Manager
Compute node are virtualized images of Windows XP
will create new sessions and retrieve desire session values
Service Pack 3.0 having 2.0 GHz of processing power
from the configured medium. CSMC does not support
and 1.5 GB of main memory. On all compute nodes
main memory based session management policy, because
IIS 5.10, Apache Tomcat 6.0, GlassFish Server 3.1.1 are
CSMC is implemented as a web service; storing the entire
deployed as web-servers along with .Net Framework 4.0
session repository in main memory is a not a feasible
and jre 1.6 as runtime environments. On each compute
solution.
node same instance of a web service is deployed to
mimic the business logic provisioned by multiple compute
nodes. We used OpenSTA [22] as a load generator for the VII. E XPERIMENTS AND R ESULTS
hosted services. To test CSMC for IIS, component stack We examined CSMC behavior on three different config-
is developed in .Net Framework 4.0 whereas for Apache urations. At the very basic level we evaluated CSMC for

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1409

modulo 2m (where m = 3) chord space, Table I shows further tested it on Apache Tomcat 6.0 and GlassFish
the number of compute nodes considered in this basic Server 3.1.1. Figure 6a and 6b show the average session
configuration. Each nodes chord identifier is mentioned retrieval time for Tomcat and GlassFish respectively.
along with its finger table entries. Additionally we have Standard Java web services were deployed on these web-
added Hosted ID, indicating node responsible for storing servers. In contrast to WCF .Net service standard Java
the session in modulo 23 chord space, because Node-0, web service does not support dynamic binding. Due to
Node-3, Node-6, are not available in chord space. lack of dynamic binding both of these web-servers take
For the medium sized cloud we considered modulo 2m more time to generate service response as compared to
(where m = 4) chord space (see Table II). In total 9 IIS 5.10. The core purpose of CSMC is to decouple active
compute nodes are used on which services are deployed sessions from service instances; through these results we
along with the CSMC component stack. Sessions are have shown that regardless of the underlying web-server
distributed among the nodes in the modulo 24 chord session decoupling can be achieved.
space, if the desired compute node is missing then it With CSMC active sessions can be migrated within
successor is held responsible for storing and processing the cloud when services are scaled according to com-
the sessions repository. putational load and number of active users. By testing
We have tested CSMC in modulo 2m (where m = 5) CSMC on three different web-servers we have shown that
chord space with the maximum capacity of 32 compute our proposed system is not confined to any specific web-
nodes. Table III shows the compute nodes, along with server. Our evaluation results highlighted the fact that
their finger table and hosted ID. CSMC can be adopted by small as well as large scale
We have tested CSMC on three different test bed cloud infrastructure. Through the experiments we have
configurations (modulo 2m where m = 3, 4, and 5 shown that for every web-server (i.e., IIS, Apache Tomcat
respectively) explained earlier in this section. Session and GlassFish Server) session retrieval time is directly
decoupling has been successfully tested in all of these proportional to number of computer nodes. Although
configurations. The purpose of these experiments is to CSMC can be used to achieved session decoupling for
emphasize on the fact that CSMC outperform the conven- fewer number of compute nodes however its utility is
tional session management architecture irrespective of the certainly increased with higher number of compute nodes.
size of cloud.
In total one million sessions values are distributed VIII. F UTURE W ORK
among the compute nodes modulo 2m (where m = CSMC can be used within private and public cloud to
3, 4, and 5 respectively). Session object consists of a achieve seamless service migration within cloud. So far
session identifier (8 bytes) and an user business object. we have evaluated CSMC within private cloud, for service
User business object constitute of date of birth, gender, scaling (up or down) and session migration. However,
security credentials and time stamp of his last interaction as cloud services are becoming prevalent the concept
with the system (3, 1, 8 and 3 bytes respectively). In total of service migration between cloud is emerging. This
session object for a particular user consist of 23 bytes. concept is very important to cater vendor lock in problem.
On each compute node load is generated by periodically Migrating services between cloud would be a trivial task
generating request through OpenSTA. if the source and target cloud are using same web-servers.
In Figure 5a best and worst case response time for For our future work are planning to migrate services
session lookup is shown. In case of Chord the best case for from our private to Amazon EC2. Service instance from
key lookup is when lookup request is routed directly to the our private cloud will be deployed on a EC2 compute
compute node that is responsible for persisting the values, node and active session will be distributed among EC2
without involving any intermediate request forwarding instances using CSMC.
compute node. Whereas, the worst case in Chord is when
lookup request is routed to the compute node that is IX. C ONCLUSION
multiple hops away from the actual compute node. These Through CSMC, we have achieved session decoupling,
intermediates nodes will fractionally increase the session enabling service provider to scale (up or down) services
lookup time as they will try to forward the request to the if required without the need to replicate existing ac-
compute node closest to the required compute node. tive sessions. Besides this, whenever new compute node
Figure 5b shows the average response time for three is added to cloud, CSMC automatically distributes the
configurations. In case of first configuration, the average sessions among the compute nodes. CSMC eliminates
response time is greater than that of the other as the the need of having a dedicated session state server,
processing load on each service in much higher than that which would increase the service provisioning cost. The
of the second and third configurations. Same is the case added advantage we get by applying Chord is the self-
with second and third configurations. However, in all of maintainability. Whenever new compute node is added or
these configurations we have achieved seamless service removed sessions are automatically distributed among the
scaling without the need of replicating the session pool available resources.
for every new instance of deployed service. CSMC evaluated on number of different configurations
In order to demonstrate practicality of CSMC we shows how compute nodes can be virtually deployed or

2012 ACADEMY PUBLISHER


1410 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

TABLE I.: Chord Space of modulo of 23 compute nodes


Finger (n + 2(k + 1))modulo 23
Chord Identifier (n) K=1 K=2 K=3 Host Id
1 2 3 5 0,1
2 3 4 6 2
4 5 6 0 3,4
5 6 7 1 5
7 0 1 3 6,7

TABLE II.: Chord Space of modulo of 25 compute nodes


Finger (n + 2(k + 1))modulo 25
Chord Identifier (n) K=1 K=2 K=3 K=4 K=5 Host Id
0 1 2 4 8 16 0
1 2 3 5 9 17 1
4 5 6 8 12 20 2,3,4
6 7 8 10 14 22 5,6
9 10 11 13 17 25 7,8,9
12 13 14 16 20 28 10,11,12
13 14 15 17 21 29 13
14 15 16 18 22 30 14
15 16 17 19 23 31 15
17 18 19 21 25 1 16,17
19 20 21 23 27 3 18,19
21 22 23 25 29 5 20,21
22 23 24 26 30 6 22
23 24 25 27 31 7 23
24 25 26 28 0 8 24
25 26 27 29 1 9 25
27 28 29 31 3 11 26,27
28 29 30 0 4 12 28
30 31 0 2 6 14 29,30
31 0 1 3 7 15 31

TABLE III.: Chord Space of modulo of 25 compute nodes


Finger (n + 2(k + 1))modulo 25
Chord Identifier (n) K=1 K=2 K=3 K=4 K=5 Host Id
0 1 2 4 8 16 0
1 2 3 5 9 17 1
4 5 6 8 12 20 2,3,4
6 7 8 10 14 22 5,6
9 10 11 13 17 25 7,8,9
12 13 14 16 20 28 10,11,12
13 14 15 17 21 29 13
14 15 16 18 22 30 14
15 16 17 19 23 31 15
17 18 19 21 25 1 16,17
19 20 21 23 27 3 18,19
21 22 23 25 29 5 20,21
22 23 24 26 30 6 22
23 24 25 27 31 7 23
24 25 26 28 0 8 24
25 26 27 29 1 9 25
27 28 29 31 3 11 26,27
28 29 30 0 4 12 28
30 31 0 2 6 14 29,30
31 0 1 3 7 15 31

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1411

1 0 0 0 2 5 0
R e s p o n s e T im e ( m illis e c o n d s )

8 0 0 W o rs t C a s e 2 0 0
R e s p o n s e T im e ( m illis e c o n d s )

R e s p o n s e T im e ( m illis e c o n d s )
B e s t C a s e

6 0 0 1 5 0

4 0 0 1 0 0

2 0 0 5 0

0 0
3 4 5 3 4 5
m m
C lo u d A r e a N e tw o r k : C h o r d S p a c e 2 C lo u d A r e a N e tw o r k : C h o r d S p a c e 2

(a) Best and Worst Case (b) Average Case

Figure 5: Session Lookup Time For 10,000 Session

2 5 0
R e s p o n s e T im e ( m illis e c o n d s ) R e s p o n s e T im e ( m illis e c o n d s )
2 5 0

2 0 0
R e s p o n s e T im e ( m illis e c o n d s )

R e s p o n s e T im e ( m illis e c o n d s )

2 0 0

1 5 0
1 5 0

1 0 0
1 0 0

5 0 5 0

0 0
3 4 5 3 4 5
m m
C lo u d A r e a N e tw o r k : C h o r d S p a c e 2 C lo u d A r e a N e tw o r k : C h o r d S p a c e 2

(a) Apache Tomcat (b) GlassFish Server

Figure 6: Average Session Lookup Time For 10,000 Sessions

reclaimed. Results shows that CSMC can be effectively [3] W. Sun, K. Zhang, S.-K. Chen, X. Zhang, and H. Liang,
utilized in varied size of cloud and number of concurrent Software as a Service: An Integration Perspective, 2009,
users. CSMC enables service provider to develop multi- pp. 558569.
[4] L.-J. Zhang and Q. Zhou, Ccoa: Cloud computing open
tenant services. architecture, in Web Services, 2009. ICWS 2009. IEEE
International Conference on, july 2009, pp. 607 616.
ACKNOWLEDGMENT [5] Google web applications for communication and collabo-
This research was fully supported by Microsoft Re- rations. [Online]. Available: http://www.google.com/apps
search Asia. [6] Windows azure platform. [Online]. Available:
http://www.microsoft.com/windowsazure
[7] M. Sato, Creating next generation cloud computing based
R EFERENCES
network services and the contributions of social cloud op-
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, A. Katz, eration support system (oss) to society, in Enabling Tech-
Randy Konwinski, G. Lee, D. Patterson, A. Rabkin, nologies: Infrastructures for Collaborative Enterprises,
I. Stoica, and M. Zahari, Above the clouds: A berkeley 2009. WETICE 09. 18th IEEE International Workshops
view of cloud computing, UC Berkeley Reliable Adaptive on, 29 2009-july 1 2009, pp. 52 56.
Distributed Systems Laboratory, Tech. Rep., 2009. [8] Aws management console, a web-based interface
[2] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and to manage your services. [Online]. Available:
I. Brandic, Cloud computing and emerging it platforms: http://aws.amazon.com/console
Vision, hype, and reality for delivering computing as the [9] Asp.net state management overview. [Online]. Available:
5th utility, vol. 25. Amsterdam, The Netherlands, The http://msdn.microsoft.com/en-us/library/75x4ha6s.aspx
Netherlands: Elsevier Science Publishers B. V., June 2009, [10] The apache tomcat 5.5 servlet/jsp container,
pp. 599616. clustering/session replication how-to. [Online].

2012 ACADEMY PUBLISHER


1412 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Available: http://tomcat.apache.org/tomcat-5.5-doc/cluster-
howto.html
[11] Sun glassfish enterprise server v3
prelude developers guide. [Online].
Available: http://docs.sun.com/app/docs/doc/820-
4496/beaha?l=jaa=view.
[12] State management overview, vol. Ar-
ticle ID: 307598. [Online]. Available:
http://support.microsoft.com/kb/307598
[13] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek,
and H. Balakrishnan, Chord: A scalable peer-to-
peer lookup service for internet applications, in
Proceedings of the 2001 conference on Applications,
technologies, architectures, and protocols for computer
communications, ser. SIGCOMM 01. New York, NY,
USA: ACM, 2001, pp. 149160. [Online]. Available:
http://doi.acm.org/10.1145/383059.383071
[14] G. Doyen, E. Nataf, and O. Festor, A performance-
oriented management information model for the chord
peer-to-peer framework, in Management of Multimedia
Networks and Services, ser. Lecture Notes in Computer
Science, J. Vicente and D. Hutchison, Eds. Springer
Berlin / Heidelberg, 2004, vol. 3271, pp. 2949.
[15] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Sto-
ica, Wide-area cooperative storage with cfs, SIGOPS
Oper. Syst. Rev., vol. 35, pp. 202215, October 2001.
[16] S. Ajmani, D. E. Clarke, C. hue Moh, and S. Richman,
Conchord: Cooperative sdsi certificate storage and name
resolution, in In First International Workshop on Peer-to-
Peer Systems. Springer-Verlag, 2002, pp. 141154.
[17] A. Binzenhofer, G. Kunzmann, and R. Henjes, Design and
analysis of a scalable algorithm to monitor chord-based
p2p systems at runtime, Concurr. Comput. : Pract. Exper.,
vol. 20, pp. 625641, April 2008. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1358302.1358309
[18] Z. Jingling, X. Yonggang, and L. Qing, Htc-chord: An
improved chord model based on topic-cluster and hierar-
chic layer, in Broadband Network Multimedia Technology,
2009. IC-BNMT 09. 2nd IEEE International Conference
on, oct. 2009, pp. 655 658.
[19] S. Ktari, A. Hecker, and H. Labiod, Power-law chord
architecture in p2p overlays, in Proceedings of the 2008
ACM CoNEXT Conference, ser. CoNEXT 08. New
York, NY, USA: ACM, 2008, pp. 39:139:2. [Online].
Available: http://doi.acm.org/10.1145/1544012.1544051
[20] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter,
P. Leach, and T. Berners-Lee, Hypertext transfer protocol
http/1.1, oct. 1999, pp. 655 658. [Online]. Available:
http://www.ietf.org/rfc/rfc2616.txt
[21] J. J. Garrett, Ajax: A new approach
to web applications. [Online]. Available:
http://www.adaptivepath.com/ideas/essays/archives/000385.php
[22] Open system testing architecture, 2003. [Online].
Available: http://www.opensta.org

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1413

A Quick Emergency Response Model for Micro-


blog Public Opinion Crisis Based on Text
Sentiment Intensity
Mingjun Xin, Hanxiang Wu, Zhihua Niu
School of Computer Engineering and Science, Shanghai University, Shanghai 20072, China
Email: { xinmj, newwhx}@shu.edu.cn; zhniu@staff.shu.edu.cn

AbstractOn the basis of discussing the information information anywhere at any time. Because the content of
spreading mechanism under Internet environment, we have micro-blog is shorter (generally no more than 140 chars
studied on how to build a public opinion monitoring model or Chinese words), the transmission speed among users is
according to the semantic content or text mining in recent faster, and the expression is also more freely.
years. A micro-blog public opinion corpus named MPO
The Social Blue Book, published in December 2009
Corpus on the content of micro-blog information as a test
data set has been constructed by our research team. In this by the Chinese Academy of Sciences, considered micro-
paper, it proposes a quick emergency response model blog as the most lethal carriers of public opinion; The
(QERM) for micro-blog public opinion crisis oriented to 2010 third-quarter Assessment Analysis Report of
Mobile Internet services. Firstly, it describes the micro-blog Chinas Response capacity to Social public opinion,
cases and emergency response plan library using web published in October 2010 by Shanghai Jiaotong
ontology language (OWL), which makes the transitive University, claimed that micro-blog was becoming an
logical reason capacity among micro-blog subjects, micro- important channel for enterprises and individuals to
blog cases and emergency plans. Secondly, it proposes an respond to public opinion.
algorithm to calculate the sentiment intensity of micro-blogs
In 2010, from the event of Yihuang self-immolation
from three levels on words, sentences and documents based
on HowNet Knowledge-base respectively. Thirdly, we caused by demolition in Jiangxi province, the
continue to study on how to update cases under the subjects protagonist Zhong Rujiu registered a micro-blog account
and quick response processes for micro-blog case base. and made a live publication about the incidents
Finally, we design a test experiment which shows some development. Many blogs written by Zhong were
merits of QERM in time, which basically meets the quick reproduced by many net friends to be a hot topic in
emergency response demand on the micro-blog public micro-blog network. In the Guo Meimei event , Guo
opinions crisis under Mobile Internet environment. Thus, it showed off her luxurious life using micro-blog, and
will provide more efficient support to the government and opened her ID as a business general manager of the China
related monitoring departments involved with the public
Red Cross, which caused a big uproar on the network and
opinions crisis.
made the China Red Cross into a confidence crisis. And
Index Termspublic opinion crisis; sentiment intensity; during the Japan earthquake in 2011, some rumors that
Emergency response; Mobile Internet Services because of the contaminated sea water by nuclear
radiation, the production of sea salt was unhealthy spread
over the net work, which caused a rush of salt.
I. INTRODUCTION From the cases above, it can be found that new
challenges are brought to the government monitoring
Micro-blog is a kind of blogging variants arising under
public opinion trends and discovering public opinion
the mobile Internet environment in recent years. It gains
crisis. At present, research on micro-blog for public
more and more attention and recognition for its short opinion in China has just started, and lacks of
format and real-time characteristics, and becomes an sophisticated systems and applications. Especially there
important platform for public opinion expression.
are not enough experience and integrated emergency
Recently, it becomes one of typical applications of
response framework on how to handle public opinions
Mobile Internet. In Wikipedia, micro-blog is described as crisis quick. On the basis of the research work about
a broadcast medium in the form of blogging allows users micro-blog services model, the status of public opinions
to exchange small elements of content such as short
crisis in China and the micro-blog public opinions corpus
sentences, individual images, or video links [1]. The
constructed by our research team. In this paper it analysis
differences between micro-blog and traditional blog are and studies the quick response mode for micro-blog
that users of micro-blog could make use of web browsers, public opinion crisis to improve the response capacity to
mobiles and other network terminals to read and publish
handle out the public opinion events.
text, images, audio and video links and other types of

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1413-1420
1414 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

II. RELATED WORK pan library and an emergency response handler engine.
The rest of this paper is organized as follows. Section 3
Ontological knowledge representation is a kind of
describes the micro-blog cases and emergency response
explicit description about the concept and the relationship
plans using OWL. Section 4 introduces an approach to
between the concept in some domain, which could
the micro-blog sentiment intensity. Section 5 details the
provide a syntax or semantic standard for communication
quick emergency response model. Section 6 analyzes
between human and computers and improve system
experimental result and evaluates the performance.
reliability and knowledge acquisition capacity [2]. Web
Finally, conclusion remarks are given in Section 7.
Ontology Language (OWL) is a part of series W3C web-
related and expanding standard, which takes with strong
III. OWL-BASED MICRO-BLOG CASES AND RESPONSE
representation and reasoning ability. OWL provides three
PLANS DESCRIPTION
increasingly expressive sublanguages (OWL Lite, OWL
DL and OWL Full) designed for use by specific In this paper, it makes use of ontology as the
communities of implementers and users. OWL Lite knowledge representation of micro-blogs and subjects. It
supports those users primarily needing a classification describes the micro-blog cases and emergency plans
hierarchy and simple constraint features. OWL DL based on OWL. To make the reasoning ability between
supports those users who want the maximum subjects and individuals, the micro-blog ontology consists
expressiveness without losing computational of Category class and Micro-blog class. A one-to-many
completeness and decidability of reasoning systems. And relation connects the two classes discussed above, which
OWL Full is meant for users who want maximum means a micro-blog individual belongs to one subject;
expressiveness and the syntactic freedom of RDF with no otherwise a subject may include many micro-blog
computation guarantees. individuals.
HowNet built by Professor Dong Zhendong is a
common sense knowledge base for Chinese words, which A. Subject Class Description
reveals and reflects the relationships among concepts In order to clearly describe the login relationship
abstracted from Chinese characters or attributes of between micro-blogs and micro-blog subjects, it makes
concepts. The crux of the HowNet philosophy is all use of the inheritance of OWL classes to define the
matters are in constant motion and ate ever changing in a hierarchy structure of subjects. The Subject Class
given time and space in the corresponding change in their definition includes two aspects: for one thing, according
attributes [3]. HowNet extracts sememes from about 6000 to the content of public opinion, the subjects are
characters with a bottom-up grouping approach, classified into political, economic, cultural, social and
respectively, classified as event class, entity class, other as the first level classification. Furthermore,
attribute or quantity class, attribute or quantity values according to micro-blog text under each first-level
class. Event Role is a semantic relation between concepts. classification, establishing different child subject
Event role is the possible participants and roles playing in categories by extracting keywords from micro-blog texts.
the event. HowNet also describes the entity class as event A structure of subject class is shown in Fig.1.
role in some events that it plays in. Relations among
those concepts mainly include hypernym-hyponym, Category
synonym, antonym, converse, part-whole, attribute-host,
material-product, agent-event, patient-event, instrument- exhaustive
event, location-event, time-event, value-attribute, entity-
value, event-role and concepts co-relation etc. Polity Economy Society Culture Others
Emergency response is an extreme important stage
during the process of dealing with emergencies [4] [9] [10]
[12] exhaustive
. The result of response would directly influence the
quantity of casualties and the degree of property loss and
environment damaging [6] [7] [8] [18]. Wang believes that Subject1 ... SubjectN

emergency response relies on successful execution of one


or more contingency plans, often managed by a command Figure 1 A Structure of Subject class
and control center [5]. A common approach is using
In Fig.1, the class Category is the top Class, and
decision support system which integrates exports
classes Polity, Economy, Society, Culture and Others are
knowledge and emergency response cases based on case
its subclasses, which represent different public opinion
reasoning[11][13][15][16[17].
categories respectively. And each public opinion
Our research team has been studying on the micro-blog
classification has many different subclasses except the
public opinion. It proposed an approach to calculate the
Other class. In default, micro-blogs in the Other class are
sentiment intensity from three levels on words, sentences
unclassified and the reasoned would select blogs in the
and documents respectively of the micro-blog texts, and
Other class to classify into other different public opinion
constructed a public opinion corpus on the content of
categories.
micro-blog information.
Each subject class has two data type attributes
In this paper, it proposes a quick emergency response
start_time and keywords inherited from its parent
model consisting of a micro-blog case library, a response

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1415

class. And the reasoned will decide the blogs categories attributes of Micro-blog class is listed in Table II as
by these two attributes. Currently, for the convenience of follows.
experiment test, the case library now only consists of 14
subject subclass, including 3 subjects 27 micro-blogs in
TABLE I.
the polity category, 5 subjects 51 blogs in the economy PART OF ATTRIBUTES OF MICRO-BLOG CLASS
category, 6 subjects 72 blogs in the society category and
Attribute
0 subjects in the culture category and others category. Attribute Name
Type
Attribute Description
The cases distribution of micro-blog case base is shown point at the referenced or
in Table I. reference_from object type
reproduced blog

reference_at data type point at the referenced blogs url


TABLE II.
THE CASES DISTRIBUTION OF MICRO-BLOG CASE BASE belong_to object type point at to the subjected ID

Quantity of Subjects Quantity of blogs blog_ID data type the unique index number

blog_author data type micro-blog author


Polity 3 27
blog_date data type published time
Economy 5 51
including provider, client type
meta_information data type
and so on
Society 6 72
blog_keywords data type keywords of micro-blog
Culture 0 0
blog_content data type micro-blogs content
Others 0 0
Class Micro-blog include 2 object type attributes
Total 14 150
reference_from and belong_to and 8 data type
attributes: blog_ID, author, date, meta_information
Besides, there are three other attributes: ID for a and content. The object type attributes reveal relations
unique number in the library, opinion_grade for among instances of class Micro-blog and Category.
representing the subjects opinion grade and panID for reference_from is used to point at a referenced micro-
connecting the response plan in the plan library. Part of blog, and belong_to to point at an individual of
program description of Category class is shown as Category class which the micro-blog belongs to. The data
follows. type attribute meta_information includes the Micro-blog
<!Description of Category class--> Provider like sina, publication client type like web or
<owl:Class rdf:ID"Category"/> mobile and IP information. As described above, some
<owl:Class rdf:ID="Polity"> definition program of class Micro-blog is shown as
<rdfs:subClassOf> follows.
<owl:Class rdf:ID="Category"/> 1) namespace definition
</rdfs:subClassOf>
<rdf:RDF
</owl:Class>
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-
<!-Attribute description for start_date -->
syntax-ns#"
<owl:DatatypeProperty rdf:ID="start_date">
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
<rdfs:domain rdf:resource="#Category"/>
xmlns:rdfs="http://www.w3.org/2000/01/rdf-
<rdfs:range
schema#"
rdf:resource="http://www.w3.org/2001/XMLSchema#dat
xmlns:owl="http://www.w3.org/2002/07/owl#"
eTime"/>
xmlns="http://www.owl-ontologies.com/.owl#"
</owl:DatatypeProperty>
xml:base="http://www.owl-
B. Micro-blog Class Description ontologies.com/microblog.owl">
Description of Micro-blog content is a knowledge 2) Class Micro-blog and attributes definition
representation of micro-blog information. In this paper, it <!Microblog class-->
defines a Micro-blog class as a blueprint of micro-blog, <owl:Class rdf:ID="MicroBlog"/>
and treats each real micro-blog text as an instance or <! object type attribute: reference_from -->
individual of Micro-blog class. According to the guide of <owl:ObjectProperty rdf:ID="refence_from">
OWL, the concept of attribute is defined as a binary <rdfs:domain rdf:resource="#MicroBlog"/>
relation, which could be specified a number of ways to <rdfs:range rdf:resource="#MicroBlog"/>
restrict like the domain and range. The Micro-blog class </owl:ObjectProperty>
has many attributes include data type and object type to <!data type attribute: date -->
describe the general fact of blog instances and <owl:DatatypeProperty rdf:ID="date">
relationship between with Category class. Part of <rdfs:domain rdf:resource="#MicroBlog"/>

2012 ACADEMY PUBLISHER


1416 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

<rdfs:range D. CBR-based Reasoning Process


rdf:resource="http://www.w3.org/2001/XMLSchema#dat The core principle of case-based reasoning is that
eTime"/> when a new issue is encountered, firstly, the system will
</owl:DatatypeProperty> match the key feature of the issue in the case base to find
C. Response Plan Ontology Descripion out one or more most similar cases with the issue, and
secondly reuse the solution of the cases. If the system is
Response Plan is an important component of the
not satisfied with the candidate solution to the issue, the
emergency response system, which is a process template
system would modify it to fit the issue, and finally store
including formulating, executing one or more disposal
the modified case as a new case into the case base as a
options. After the analysis of more than 100 sets of
reference when encountering a new question next time.
emergency plan instances and some reference papers, this
The case-based reasoning in the paper is formalized for
paper considers that a plan template consists of
purposes as the common four-step (R4) process:
application scope, organizational structure, resource,
1) Retrieve. To the given target micro-blog subject,
workflow and other relative content.
the system will retrieve similar-subject cases from
As discussed above, it describes organization, resource,
the case base to process it.
event and workflow as an entity class respectively. And
2) Reuse. Each case has a PlanID to its response plan.
in the definition of class plan includes attributes like
Map the response plan from the previous similar-
planID for a unique number, planAim for the plans
subject cases to the target micro-blog subject. This
aim and planPrinciple for the principle of formulating
may involve adapting the solution as needed to fit
the plan. Some attributes are detailed in table III.
the new situation.
3) Revise. Having mapped the previous response plan
TABLE III. to the target micro-blog subject, analyze the plan
ATTRIBUTES OF PLAN ONTOLOGY with experts validation and, if necessary, revise.
Name Type Description 4) Retain. After the plan has been passed by
validation, store the final result experience as a
planAim date type plan aim
new case-plan in the library.
planPrinciple date type formulating principle This paper below will introduces the quick emergency
Plan
organization object type organization structrue response model (QERM), plan reproduction based on the
resource object type resource need CBR mechanism and the reasoning process driven by the
event object type event
above QERM model separately in section 5.
workflow object type workflow
IV. SENTIMENT INTENSITY COMPUTATION MODEL
leader date type direct responser
Organization
members date type members
tag date type resource name

Resource quantity date type quantity


status date type status, like ready
eventType date type event type
Event
eventSummary date type event summary
eventLevel date type event level
workFlowTag date type task name
condition date type trigger conditions
organization object type responser
Wokflow
taskDescription date type task description
status date type status
nextTask date type next task

The response plan ontology consists of class Plan,


class Organization, class Resource, class Event and class
Workflow. And the organization in the plan means those
who directly response for execution of the whole plan,
and the one in the Workflow those who for one task
execution. The event levels in the Event class are defined
according to the Nation Accidents Classification The sentiment intensity computing model oriented micro-
Standards as particularly significant (I level), major (II blog is the foundation of classification of documents
level), large (III level) and general (IV level). based on the text emotional intensity, and is also the basis
of public opinion research about micro-blog information

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1417

platform. The model proposed in the paper includes three B. Sentence Emotional Strength
levels with the emotional intensity calculation from Words are the basic unit of the sentences, but
words, sentences and documents. The model frame is sometimes a single word does not accurately reflect the
show in Fig.2 semantics of a sentence such as:
In Fig.2, for the starts, the original documents will be Sentence 1: Fuel consumption of Excelle is really high
pre-processed as words segmentation. Then the program Sentence 2: Ettas cost performance is very high
will compute the similarity based the algorithm in Sentence 1 and sentence 2 are emotional sentences, but
reference [11] between the words in the documents and the emotional word high shows different polarities
HowNet sentiment analysis set to set the words when modified different objects: high indicates
emotional intensity. This is the words level process. For derogatory in the sentence 1 while compliment in the
the sentences level, the program will analyze the sentence 2. Therefore, we study the modified relationship
relationships between words consisting of phrases like between the adjacent words before calculating the
modified relationships, parallel relationship.etc. The sentences emotional intensity. Some researchers have
sentimental intensity of sentences is computed based on found the phrases structures with certain emotional
the relationships of words and words intensity. Finally, meaning are usually nouns, verbs, adjectives, adverbs
with different positions of sentences, the program phrases. The common Chinese phrase types such as
analyzes the position of sentences in the context and sets prejudiced phrase are shown in Table IV.
each sentence different weight to calculate the
documents intensity. The detailed instructions will be
shown as follows respectively. TABLE IV.
ATTRIBUTES OF PLAN ONTOLOGY
A. Word Emotional Strength
Word emotional intensity computing is based on Grammar Structures Examples
HowNet sentiment analysis set, which consists of A clever girl
Chinese and English emotion analysis words sets, adjective+noun
(Chinese: )
including positive and negative evaluation words,
positive and negative emotion words, degree-level words noun+verb, Wang likes
noun+adjective (Chinese: )
and words and claim words. Because the emotion
difference between the evaluation words and emotional verb+noun, Like clean
words are not very obvious in the research, the sentiment One verb+adjective (Chinese: )
analysis words are merged into the positive words set and center
The affinity of idols
negative words set in the paper, such as: word noun+of+noun
(Chinese: )
Positive words: love and dote, love and esteem, caress,
love. degree-adv.+ adj./adv., Very good
adj./adv.+ degree-
Negative words: sad, pity, grieved, deep sorrow, dump. adverb
(Chinese: )
Pre-process the words in the set and give all the
positive emotional words the weight 1, all the negative Negative word +adj./ Do not like
verb/adv. (Chinese: )
emotional words the weight -1 for the emotion sets. The
degree-level words in the set do not contain any Multiple
Adjective+adjective, Bright and smart
emotional information, but modify emotions degree center
noun+noun, verb+verb (Chinese: )
words
intensity. So the paper gives these words a positive real
number weight between 1 and 10. Then compute the
similarity based the algorithm in reference [11] between
To compute the emotional intensity, it obeys the
words in the documents and the processed HowNet
following rules in the paper:
sentiment analysis set. The proposed computing
a) The emotional strength of the parallel structure
algorithm of words emotional intensity list as follows:
phrases such as: noun + noun, adjective +
if speech of word is Degree Adverb then
adjective is equal to the sum of the each word
Calculate the similarity between word and word in
strength.
the HowNet degree-level words set;
b) The emotional strength of the modified structure
Note the biggest similarity sim and weight
phrases such as: adjective + noun, adjective +
weightof word;
adverb is equal to the product of multiplying
else if speech of word is one of nouns, verbs,
like intensity( adverb) * intensity(adjective)
adjectives then
To facilitate the calculation of the emotional intensity
Calculate the similarity between word and word in
of the sentence, two presumptions are made:
the HowNet emotional words set;
a) Each sentence is a single sentence, and complex
Note the biggest similarity sim and weight
sentences composed by the conjunction
weightof word;
artificially are split into two sentences;
else intensity(word) = 0;
b) The similarity based on HowNet is increased by
intensity(word) = weight * sim;
10 times.

2012 ACADEMY PUBLISHER


1418 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

Under the analysis of semantic relations between 3) Track the micro-blog with the url address value of
words in the phrases and the context relations in the attribute reference_at until the address is null;
sentences, the sentence emotional intensity algorithm is 4) If the publication client is mobile, then get the base
designed and shown as follows: station position through the mobile IP stored in the
Intensity = 0; meta_information attribute; And continue
While (word1is not the last word){ tracking the similar micro-blog examples around
If there is modified relationship between the base station;
word1and word2 5) Handle the blog examples by semantic analysis and
Combine word1 and word2 into word; sentiment intensity computing.
Intensity (word) += intensity (word1) * 6) Decide the examples category using reasoner
intensity (word2); 7) Update the public opinion intensity of the subject
word1 = word; category
Else intensity += intensity (word1) + intensity The micro-blog sentiment intensity uses the method in
(word2); reference. The public opinion intensity of one subject
} category is calculated by the linear addition of each
blogs sentiment intensity.
C. Document Emotional Strength
In a document, the relationships between sentences, B. Case Retrieval and Plan Reproduction
such as the assumed, transition and progressive, affect the Case retrieval and reproduction is an important part of
document emotion intensity. The topic sentence in the emergency response system based case-reasoning base. In
document occupies a central position having significant this paper, it makes use of subject keywords and public
impact on document emotion intensity. In this paper, it opinion intensity to retrieve the case base.
calculates the intensity of document using a linear 1) select the approximately equal intensity of subjects
expression. It gives each sentence a different weight to as the optional subjects;
reveal diffident positions in a micro-blog text. To the 2) computing the semantic similarity of the keywords
topic sentence or central sentence, it has a higher weight. between optional subject and new subject, and
The calculation is according to the following formula ( choose the biggest similarty as the final optional
and are the correlation coefficients): subject;
intensity = * intensity topic sentence + 1 * 3) modify the response plan of the optional subject
until the modified one is passed by expert;
intensity sentence1 ++ n * intensity
4) store the passed response plan into the plan base
sentenceN (1) and start the emergency response;
In the formula discussed above, the position is more
important in the document, the coefficient is larger. C. The Reasoning Process for QERM
Usually, the coefficients about the topic sentences are set The quick emergency response model is based on case
a float number among 0.5-1 and other sentences base and response plan base. The most important part is
coefficients are set among 0-0.5. the engine which consists of topic tracking and case-
Sentiment Intensity Computation Model is part of the retrieval and reproduction subsystems. The topic tracking
QERM. By the intensity, the QERM will identity the subsystem track new micro-blog example using the blog
micro-blogs public opinion intensity and select seed case as seeds, and then updates the case base. And The
cases in the case base. The next section will detail the subject class Category is defined as a three-dimensional
QERM and introduce the information flow in the QERM. vector :< keywords, intensity, planID>, with which the
system will start emergency response. Detail steps are
V. QUICK EMERGENCY RESPONSE MODEL described as follows:
// extract the key attributes of the subject
Emergency response is an information sharing process.
while the value of reference_at is not null then
The QERM works on the R5 model of CBR, and is based
tracking the micro-blog with the address at the
on case base and response plan library. It is driven by
value of reference_at
topic tracking, and approached by owl reasoning. Topic
end
tracking, case-based reasoning, the case and response
if type is mobileType then
plan base automatically updated compose the response
get the base station position through the mobile IP
engine. The workflow of QERM is shown in Fig.3.
stored in the meta_information attribute;
A. Keywords-based Topic Tracking track the similar micro-blog examples around the base
The purpose of topic tracking based on micro-blog station;
keywords is to make the instances in the subjects end
categories more affluent and get a more accurate response. ...
Its main idea is: // computing the semantic similarity of the keywords
1) Sort the cases by their sentiment intensity from the between optional subject and new subject
big to small in one subject. And select N micro- if similarity > the threshold TH then
blog cases as seed cases; decide the blog case as the final optional
2) Extract the key attributes of the seed cases; end

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1419

The proposed QERM uses the owl reasoning tool in by OWL reasoning mechanism. On the other hand, its
the processes of topic tracking, case retrieval and plan core engine is composed of CBR-based topic tracking,
reuse to implement the update of case base and plan base. the case base and response plan library. The test
It provides a quick response to the micro-blog public experiment proves the superior of the quick emergency
opinion, which can assist the government and experts to response model, which also provides a better technical
emergency incidents. support for government monitoring department to handle
the emergency public opinion incident quickly and
Micro-blog cases successfully. In the future research work, we will pay
Topic Tracking based on
more attention to propagation chain and model to get
micro-blog keywords
better results for the public opinion monitor.

Describe micro-blog
individuals using owl

Update the public opinion


intensity of subject

Retrieve similar subject Case Retrieve


category in the case base Case base

Expert
Reproduce and modify the Validation
Plan library
candidate plan

Start the emergency


response
Update the case base and plan base

Figure 3. The workflow of QERM

VI. EXPERIMENT AND RESULTS ANALYSIS


To test the performance of the proposed model, this
paper design a simulation experiment using the case base
as the data set, and choosing subjects Guo MM event,
grab salt incident, and 7.23 high-speed rail event as
the test subjects. The experiment takes the time in which
the system gets a reliable plan as the test result. The
results are shown in Fig.4. ACKNOWLEDGMENT
It can be seen from Fig.4a and Fig.4b that the proposed
QERM has some merits in the response time (about 15 Our research is supported by National Natural Science
minutes). It could meet the quick response demand in the Foundation of China (Project Number. 61074135 and
micro-blog public opinion emergency event. However, 60903187), Shanghai Creative Foundation project of
the proposal model has some disadvantages that the Educational Development (Project Number. 09YZ14),
performance of topic tracking is not enough good. And and Shanghai Leading Academic Discipline Project
this is also the future work we will work for. (Project Number.J50103), Great thanks to all of our hard
working fellows in the above projects.
VII. CONCLUSION
REFERENCES
Recently, the Micro-blog as a new personal media
network service is becoming an important channel for [1] Wikipedia [R]. http://en.wikipedia.org/wiki/Micro-blog
[2] Web Ontology Language Guide.
people to get and publish their information and ideas.
http://www.w3.org/TR/2004/REC-owl-guide-20040210/
With that the micro-blog public opinion events discussed [3] How Nets Home Page. http:// www.keenage.com
in this paper, it continues to study the quick micro-blog [4] Zhang Zimin, Zhou Ying, Mao Xi. Emergency Response
emergency response model (QERM) by using OWL Information Model Based on Information Sharing (Part I ):
reasoning tools on the base of labs research result. The Model Definion. China Safety Science
public opinion intensity of micro-blog subjects is Journal.Vol.20,No.8.pp154-160,Aug 2010.
computed by the given sentimental intensity algorithm.
The QERM is driven by topic tracking, and approached

2012 ACADEMY PUBLISHER


1420 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

[5] Wang Wenjun, Meng Fankuo. Research on Ontology- on Ontologies and Problem Solving methods(KRR5)
based Emergency Response Plan Template. Computer 1999, 1-15.
Engineering, vol.32,No. 19. pp.170-172 , October,2006. [17] P. Kruchten, C. Woo, K. Monu and M. Sotoodeh, A
[6] Jennex M. E. Modeling emergency response sysetems. Humman-Centered Conceptual model of Disasters
40th Annual Hawaii International Conference on System Affection Critical Infrastructures, ISCRAM 2007,
Sciences. Hawaii, USA, 2007: 22-29. Netherlands May 13-16, 2007.
[7] Mendonca D, Beroggi GEG, Wallace WA. Evaluating [18] Shuren Bai, Peng Du, An Organization Model based on
Support for Improvisation in Simulated Emergency Party Pattern to Support Dynamic Change for Role-based
Scenarios. Proc. of the HICSS, 2003. Workflow Application, Proceedings of the IEEE
[8] Dyer D, Cross S. Planning with Templates. IEEE Workshop on Distributed Intelligent Systems: Collective
intelligent Systems, 2005,20(2). Intelligence and Its Applications(DIS06).
[9] I.M.Dokasa, D.A. Karrasb, D.C.Panagiotakopoulosc.
Fault tree analysis and fuzzy expert systems: Early
warning and emergency response of landfill operations.
Environmental Modelling & Software, 2009, 24(1):8-25. Xin Mingjun was born in 1970. received the
[10] Liao zhenliang, Liu yanhui. Emergency plan system for PhD degree in computer science from
emergency pollution incident emergency response on the Northwestern Polytechnical University, China. He
base of case-based reasoning. Environmental Pollution is currently a vice professor in School of
and Control, 2009, 31(1):86-89 Computer Engineering and Science in Shanghai
[11] Wang Wenjun, Zhang Xiankun. Emergency Response University, China.
Organization ontology Model and Its Application. Second His research interests include service computing, decision
international Symposium on Information Science and support system and information system.
Engineering.
[12] Li Hua, Zhu Xianmin, Zhao Daozhi. Research on SUMO-
based Emergency Response Management Team Model,
Wireless Communication, Networking and Mobile Wu Hanxiang was born in 1985, earned B.S
Computing, 2007. WiCom 2007. International Conference degree in the field computer science and
on 21-25 Sept. 2007 Page(s):4606-4609. technology in 2009 from Tianjin University of
[13] Han Fuyou, Zhang Hailong, Dong Liyan. Research on Technology and Education. He is currently a post
Evaluation Model of Emergency Response Plans, student of School of Computer Engineering and
Proceedings of the 2009 IEEE International Conference on Science in Shanghai University, China.
Mechatronics Automation, August 9-12, Changchun, His research interests include web services
China. security, content audit, public opinion monitor.
[14] Lei Ji, Hong Chi, An Chen, Emergency Management,
Higher Education Press, Beijin, 2006. (in Chinese)
[15] Wallace W.A., DeBalogh F., Decision support systems
for disaster managermetn, in Public Administration Niu Zhihua was born in 1976. She received
Review 45, 1985, pp.134-146. her Ph.D degree at Xidian University. Now she
[16] Perez A G and Benjamins V R. Overview of Knowledge is a lecturer at the School of Computer
Sharing and Reuse Components: Ontologies and Problem Engineering and Science, Shanghai University.
Solving Methods, Proceedings of the IJCAI99 workshop Her main research fields ar cryptography and
information security.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1421

A New Text Clustering Method Based on KSEP


ZhanGang Hao
Shandong Institute of Business and Technology, Yantai ,China
Email:zghao2000@hotmail.com

AbstractText clustering is one of the key research areas in In the past few years, some people have been studied
data mining. k-medoids algorithm is a classical division for text clustering.Xu Sen et al proposed Spectral
algorithm, and can solve the problem of isolated points, But clustering algorithms for document cluster ensemble
it often converges to local optimum. This article presents a problem . In this paper, two spectral clustering algorithms
improved social evolutionary programming(K-medoids were brought into cument cluster ensemble problem. To
Social Evolutionary Programming,KSEP). The algorithm is make the algorithms extensible to large scale applications,
the k-medoids algorithm as the main cognitive reasoning
the large scale matrix eigenvalue decomposition was
algorithm. and Improved to learning of Paradigm Optimal
avoided by solving the eigenvalue decomposition of two
paradigm strengthening and attenuation and Cognitive
agent betrayal of paradigm. This algorithm will increase the induced small matrixes, and thus computational
diversity of species group and enhance the optimization complexity of the algorithms was effectively reduced.
capability of social evolutionary programming, thus Experiments on real-world document sets show that the
improve the accuracy of clustering and the capacity of algebraic transformation method is feasible for it could
acquiring isolated points. effectively increase the efficiency of spectral algorithms;
both of the proposed cluster ensemble spectral
Index TermsText clustering, K-medoids algorithm, social algo-rithms are more excellent and efficient than other
evolutionary programmi ng common cluster ensemble techniques, and they provide a
good way to solve document cluster ensemble problem[5].
DHILLON I S et al proposed SKM algorithms (sphe
I. INTRODUCTION -rical K-means). It Has been proved to be a very efficient
Text clustering methods have been some. K-menas algorithms. However, SKM algorithm is gradient-based
algorithm and k-medoids algorithm are efficient, able to algorithm, the objective function with respect to the
d
effectively handle large text collection, but will generally concept of vectors in R is not strictly concave function
converge to a local minimum, it is difficult to ensure that space. Therefore, different initial values will converge to
the global minimum.Some text clustering algorithm are different local minima, that algorithm is very unstable[6].
proposed, For example, SKM algorithms, WAP Guan Renchu et al proposed WAP( weight affinity
algorithms and other algorithms[5-11]. Most algorithms can propagation)algorithms. Abstract Affinity propagation
be more efficient to solve the problem text clustering. (AP) is a newly developed and effective clustering
However, these algorithms find isolated points in the algorithm. For its simplicity, general applicability, and
results in terms of weak. good performance, AP has been used in many data
Social evolutionary programming (SEP) is an mining research fields. In AP implementations, the
algorithm based on paradigm conversion into global similarity measurement plays an important role.
search algorithm[1-2], has been used to solve the problem Conventionally, text mining is based on the whole
of clustering[3-4], But not solved the problem of isolated vector space model(VSM)and its similarity measurement
points. This article presents a improved social -s often fall into Euclidean space. By clustering texts in
evolutionary programming(K-medoids Social Evolutiona this way, the advantage is simple and easy to perform.
-ry Programming,KSEP).The algorithm K-medoids However, when the data scale puffs up, the vector
algorithm is as cognitive subject's cognitive reasoning space will become high-dimensional and sparse. Then,
algorithm; raised awareness of the main new paradigm in the computational complexity grows exponentially. To
the study of clustering in the way; propose a new overcome this difficulty, a nonEuclidean space similarity
paradigm of the optimal formula to strengthen and decay. measurement is proposed based on the definitions of
This algorithm will increase the diversity of species similar feature set(sFS}rejective feature set(RFS) and
group and enhance the optimization capability of social arbitral feature set (A F S).The new similarity measure
evolutionary programming, thus improve the accuracy of -ment not only breaks out the Euclidean space constraint,
clustering and the capacity of acquiring isolated points. but also contains the structural information of documents.
Therefore, a novel clustering algorithm, named weight
II. LITERATURE REVIEW
affinity propagation(WAP)is developed by combining
the new similarity measurement and AP. In addit
-ion, as a benchmark dataset, Reuters-21578 is used to
Manuscript received September 30,2011; revised November test the proposed algorithm. Experimental results show
20,2011;accepted November 26,2011. that the proposed method is superior to the

2012 ACADEMY PUBLISHER


doi:10.4304/jsw.7.6.1421-1425
1422 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

classical k-means, traditional SOFM and affinit characteristic word in text d;, wi (d) is the weight of ti in
-y propagation with classic similarity measurement [7]. d, calling V(d) the vector space expression of text d, Wi
PENG Jing et al proposed a novel text clustering (d)= (tf i ( d )) . uses TFIDF function, which has
algorithm based on Inner product space model of
many formulas in actual application. The one used by this
semantic. Abstract Due to lack considering the latent
paper is
similarity information among words, the clustering
(log( tf i ) + 1 .0 ) log( N | n i ) (1)
result using exist clustering algorithms in processing text wi (d ) =
l
data, especially in processing short text data, is not
ideal. Considering the text characteristic of high [(log( tf
i =1
i ) + 1 . 0 ) log( N | n i )] 2
dimensions and sparse space, this paper proposes a novel In the formula, tfi is the frequency of characteristic
text clustering algorithm based on semantic inner space word ti in text d, N is the total text number in the text
model. The paper creates similarity method among group, ni is the number of texts in the text group that
Chinese concepts, words and text based on the definition contain characteristic word ti, l is the number of
of inner space at first, and then analyzes systematically characteristic words in text d.
the algorithm in theory. Through a two phrase processes,
i. e. top-downdivide" phase and a bottom-upmerge" IV. TEXT CLUSTERING METHOD BASED ON KSEP
phase, it finishes the clustering of text data. The method
has been applied into the data clustering of Chinese short A. K-medoids-based body of cognitive reasoning
documenu. Extensive experiments show that the metho algorithm
-d is better than traditional algorithms[8]. In order to enhance the ability of algorithms to find
In addition, Hamerly G[9],WagstaffK[10],Tao Li[11] ,G. outliers, this algorithm will be k-medoids algorithm as
Forestier [13],Wen Zhang[14],Linghui Gong[15] and Argyris the individual's cognitive.
Kalogeratos[16]Were also proposed the method of text K-medoids algorithm operating principle is shown
clustering.However, these methods are not effectively below: The primary idea of the k-medoids algorithm is
solve the problem of isolated points. So,This article that it firstly needs to set a random representative object
presents a improved social evolutionary programming for each clustering to form k clustering of n data. Then
(K-medoids Social Evolutionary Programming,KSEP). according to the principle of minimum distance, other
Compared with the k-menas algorithm, the KGA data will be distributed to corresponding clustering
algorithm not only can better solve the problem of according to the distance from the representative objects.
isolated points, and be able to find the global optimum. The old clustering representative object will be replaced
Compared with the K-medoids algorithm, isolated point with a new one if the replacement can improve the
of the search algorithm better, and be able to find the clustering quality. A cost function is used to evaluate if
global optimum. With the new algorithm, the KGA the clustering quality has been improved[12]. The function
algorithm can not only efficiently, but more good points is as follows:
to solve the problem in isolation. E = E 2 E1 (2)
III. CHARACTERISTIC DENOTATION OF TEXT where E denotes the change of mean square error;
A Chinese Text Categorization model first makes E2 denotes the sum of mean square error after the old
Chinese text groups participle and vector, forming a representative object is replaced with new one;
characteristic group, followed by the extraction of a most E1 denotes the sum of mean square error before the old
optimum characteristic sub group from all characteristic
representative object is replaced with new one.
groups using characteristic extraction algorithm
according to characteristics evaluation function.
Chinese text transforms non-structural data to
structural data by the treating the participles, using text
vector space model. The basic idea of VSM can be
explained in such a way, each article in the text group is
denoted as a vector in a high dimensional space
according to predefined vocabulary order. Word in
predefined vocabulary order is viewed as the dimension
of the vector space and the weight of the word is viewed
as the value of the vector in a certain dimension of the
high dimensional space, consequently, the article is
denoted as a vector in a high dimensional space. The
advantage of VSM is that it is simple, not demanding on
semantic knowledge and easy for calculation. Figure1.k-medoids algorithm clustering process figure
This model defines text space as a vector space K-medoids clustering algorithm follows four main
composed of orthogonal words vector. Each text d is processing as Fig1.
denoted as a normalized characteristic vector
If E is a minus value, it means that the clustering
V(d)=(t1,w1(d)ti,, wi (d);;tn, wn (d)), ti is the

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1423

quality is improved and the old representative object believable that some data near clustering center p
should be replaced with new one. Otherwise, the old one ( p is the mean value of category h ) under category h
should be still used.
The procedure of the k-medoids algorithm is as embody better cognitive behavior of the agents so that the
follows: heritage of such behavior should proceed from NO.k
(1) Choose stochastic k objects as the initial clustering individual, i.e. each categories under category h
representative objects from n data; reserves certain data (to the preset ratio, e.g. ratio=0.45 as
(2) Circulate steps from (3) to (5) until every clustering assigned in this article and taken round numbers upward).
doesnt change; The reserved data are still of c number of categories and
(3) According to the distance (generally using the rest will be allocated to these categories according to
Euclidean distance) between each datum and the the similarity (Euclidean distance is used in this article)
corresponding clustering representative object and to the clustering center (means of all data under the
according to the minimal distance principle, distribute category) as to complete heritage and generate a new
each datum to the corresponding clustering; paradigm.
(4) Randomly choose a not representative object
D. Optimal paradigm strengthening and attenuation
Orandom E of changing with
and calculate the cost To strengthen SEP local self-optimizing ability,
Oj learning probability p1 of currently most optimal
the stochastic representative object chose;
Oj Orandom paradigm F [1] may be artificially enlarged.
(5) If E is minus, replace .with the .
Mean-while, p1 value should also be attenuated step by
B. Evolved self-optimization process of paradigm-based step in order to prevent entire social population
learning and updating convergence toward most optimal paradigm to lead to
A good paradigm is a good viable solution record. reduction of global self-optimization capability. The
Here, F stands for paradigm, M for the number of specifics are illustrated as follows:
[]
paradigms, F i for NO.i paradigm ( i = 1,2 M ). If a new currently most optimal paradigm

M number of paradigms are arranged according to []


F 1 generated in k generation, in the process of k + 1
object function value f ( F [i ]) in ascending sequence, as generation clusters, the probability of learning
shown below: []
paradigm F 1 is designated as p1 p1 (0,1), and
[] [] [ ] [ ]
f (F 1) f (F 2 ) f (F M 1) f (F M ) . We can the probability pi (i = 2,3, , M ) learned by other
obtain series of individuals through application of paradigms as
K-means algorithm. Once a new individual F l is [] p i = [1 / f ( F [i ]) ](1 p 1 ) /
M

1/ f ( F [i ]) (3)
obtained, it is inserted in the proper position of M i=2

number of paradigms arranged in object function value In general, the more the algorithm is more close to the
ascending manner if its object function is smaller than a evolution of post-its optimal solution, In order to not
certain object function value already having individuals, destroy the optimal solution as much as possible,the
i.e. for j (1, M ) ,if f (F[ j 1]) < f (F[l]) < f (F[ j]) , learning probability p1 of Optimal paradigm, In the
then F [ j] = F [l ] , period (such as the first half of the cycle) and given its
F [ j + 1] = F [ j ], , F [M ] = F [M 1] . Such as, in
relatively small value, Later re-assigned a higher value.
This is to keep a good paradigm, but also can increase the
the entire evolving process, M number of paradigms are diversity of population.
constantly in a dynamic updating status. In the process of each generation clusters between
C. Learning paradigm form of cognitive agent in cluster k +2 generation and k +t generation(the
supposition is renewed once more in k + t generation
A new paradigm produced in NO.k generation
of currently most optimal paradigm), the probability
cognitive agent should refer to NO.( k 1) generation
paradigm. In the k generations, all sorted in the selected
[]
p1 of paradigm F 1 learned by other paradigms in
paradigm after 1 / 3 of the paradigm(Because the function turn as:
value in accordance with small to large order, after the 1 /
3 of the paradigm is the paradigm of the best part of all).
If there are M paradigm, that is, select the M / 3 (rounded
up) a paradigm. In this M / 3 paradigm paradigm in a
randomly selected. In this paradigm, the category
h(h (1, c), c is the number of clusters)of the number
of datum i (i (1, n) , n is the numberof data to be
clustered) to undergo can be explicitly displayed. It is

2012 ACADEMY PUBLISHER


1424 JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012

p1k (100 1(i 1) ) form the isolated points. The first 500 articles for the
100 experiment are sourced from http://dlib.cnki.net/kns50/.

if u
( t 2) The 5 categories are industrial economy(IE), cultural and
economic(CE), Market Research and Information(MRI),
3
k Management(M), service economy(SE). respectively. The
p1 (100 2 )
(i 1)
(4) last category is current affair and news(CAN) sourced
100 from http://www.baidu.com/. After having undertaken
p1k +i =
if
( t 2) < u < 2 ( t 2) basic treatment and dimension reduction to these files,
3 3 k-medoids algorithm and KSEP algorithm are used for
pk (100 (i 1) ) clustering analysis.
1 3

100 A. Experiment 1
2 ( t 2) First, k-medoids algorithm is used for clustering
if u
3 analysis. The results are shown in Table 1.
In which, i 2,3, , t ) , parameter 1 , 2 , 3 controls
TABLE 1 RESULTS FROM K-MEDOIDS ALGORITHM
attenuation rate, its right shoulder mark (i 1) is the
IE CE MRI M SE CAN
times of power. The less the 1 , 2 , 3 , the slower the Wrong articles 59 60 55 52 57 1
Correct articles 41 40 45 48 43 4
attenuation. In general, Percentage of 41 40 45 48 43 80
1 ( 2.5,3) , 2 (1.5, 2.5 ) , 3 (1,1.5 ) Other correct ones
Time(second) 32.5
paradigm genetic rate pi , i (2,3, , M ) will stil As can be seen from the above experiments,
K-medoids algorithm for text clustering, the time is very
luse Eq. (3) for computation.
short, very efficient, but also better identify isolated
E. Cognitive agent betrayal of paradigm points. However, clustering results are not satisfactory,,
. Assume cognitive agent mutation probability clustering accuracy is very low.
threshold , which is used to determine whether a certain B. Experiment 2
cognitive agent bears the nature of betrayal, whereas Then, KSEP algorithm is used for clustering analysis.
behavioral mutation probability threshold is used to The results are shown in Table 2.
determine on which time or times of specific behavior in TABLE 2 RESULTS FROM GA-K ALGORITHM
the entire process the cognitive agent inclining to betray
fall into betrayal. IE CE MRI M SE CAN
. Prior to cognition of each cognitive agent, a Wrong articles 9 11 8 7 9 0
random number is given by an evenly distributed
generator. If it is not greater than mutation threshold ,
Correct articles 91 89 92 93 91 5

it is considered that it does not have betrayal nature and Percentage of 91 89 92 93 91 100
its behavioral process rigorously follows cognitive agent correct ones
learning paradigm form to complete genetic process as
mentioned above; otherwise, this agent has the nature of Time(second) 1893
betrayal and is continued instep .
. If the cognitive agent is identified bearing betrayal As can be seen from the experiment 2, algorithms
nature, a random number is assigned by the evenly presented in this paper KGA increased with time despite
distributed generator. If the random number is not greater the many, but clustering effect is very good. As can be
than behavioral mutation rate the behavior does not
seen from Table 2, significantly reduced the number of
false papers, the correct number of articles increased
belong to betrayal behavior and follow existing paradigm significantly, but also to identify well isolated point.
genetic form as described in cognitive agent learning
paradigm; otherwise, this agent has the nature of betrayal VI. SUMMARY
and chaotic mutation operator will be applied to produce Text clustering is widely used in real world and an
a new individual. important subject for data mining. K-medoids algorithm
The value of and ,in the early to give its larger is a more classical clustering algorithm, but its accuracy
value, given its relativ -ely small in the latter part of the is lower. This paper embeds k-medoids algorithm into
value. Social Evolutionary Programming, and Improved to
learning of ParadigmOptimal paradigm strengthening
V. EXPERIMENTAL ANALYSIS and attenuation and Cognitive agent betrayal of paradigm.
This algorithm will increase the diversity of species
This paper picks up 505 articles in 6 categories from
group and enhance the optimization capability of social
CQVIP as experiment data. The first 5 categories contain
evolutionary programming, thus improve the accuracy of
100 articles each and the last category contains 5, which
clustering and the capacity of acquiring isolated points.

2012 ACADEMY PUBLISHER


JOURNAL OF SOFTWARE, VOL. 7, NO. 6, JUNE 2012 1425

ACKNOWLEDGEMENTS JIANG Han-Kui, A Novel Text Clustering Algorithm


Based on Inner Product Space Model of Semantic[J],
This paper is supported by the National Natural CHINESE JOURNAL OF COMPUTERS, 2007, 30
Science Foundation of China (Grant No.70971077), (8),1354-1362.
Shandong Province Doctoral Foundation (2008BS0 [9] Hamerly G, Elkan C.Learning the k in k-means // Pm
1028), Natural Science Foundation of Shandong Province -ceedalgs of the 17th Annual Conference on 1\eural
(Grant No.ZR 2009 HQ005, ZR2009HM008). hlfamatiou Pmcessalg Svstmls(NIPS).2003,281-289.
[10] WagstaffK, Cardie C,Rogers S, Schroedl S Constranied
K-mearns clustering with background knowledge In
REFERENCES
Brodley CE, Danyluk AP, eds. Proc of the 18th Int 1 Conf
[1] Yu Yixin, Zhang Hongpeng. A social cognition model on Machine Learning[M].William stow M organ Kauf m
applied to general combination optimization problem. ann Publishers 2001.577-584.
Proceedings of the first international conference on [11] Tao Li Docunent clustering via Adaptive Suhspace lt
machine learning and cybernetics, November4-5,2002 -eration[ A]. In proceedings of the 12th ACM international
Beijing China,1208~1213. Conference on Multimedia[C]. New York ACM Publisher
[2] Sebastien Picault, Anne Collinot, Designing Social 2004 364- 367.
Cognition Models for Multi-Agent Systems through [12] Zhu Ming, Data Ming,HeFei:China Science and Tec
Simulating Primate Societies,Proceedings of -hnology University Press, 2002,129-164.
ICMAS98(3rd International Conference on Multi-Agent [13] G. Forestier, P. Ganrski , C. Wemmert. Collaborative
Systems),1998,238~245. clustering with background knowledge [J]. Data &
[3] HAO Zhangang, Building Text Knowledge Map for P Knowledge Engineering,2010,69(02):211-228.
-roduct Development based on CSEP Method ,2009 [14] Wen Zhang a, Taketoshi Yoshida b, Xijin Tang c, Qing
International Conference on Computer Network and Wanga, Text clustering using frequent itemsets[J],
Multimedia Technology2009, 12 : 1081-1085. Knowledge-Based Systems,2010,23(5),379-388.
[4] HAO Zhangang,YANG Jianhua, Building Knowledge M [15] Linghui Gong, Jianping Zeng , Shiyong ZhangText
-ap for Product Development based on GAKME Method. stream clustering algorithm based on adaptive feature
The Second International Workshop on Education selection[J], Expert Systems with Applications, 2011, 38
Technology and Computer 2010,3:696-699. (3),1393-1399.
[5] XU Sen, LU Zhi-mao,GU Guo-chang, Spectral clustering [16] Argyris Kalogeratos, Aristidis Likas, Document cluste -ring
algorithms for docu -ment cluster ensemble problem[J], using synthetic cluster prototypes[J], Data & Knowledge
Journal on Communications, 2010, Vol. 31 No.6,58-66. Engineering, 2011,70(3), 284-306.
[6] DHILLON I S, MODHA D S. Concept decompositions for
large sparse text data using clustering[J]. Macliine
Learning, 2001, 42(1-2):143-175.
[7] Guan RenchuPei ZhiliShi Xiaohu,Yank Chenand Liana ZhanGang Hao 1976,3. Obtained from Tianjin University in
Yanchun, Weight Affinity Propagation and Its Application 2006 PhD in Management. Research areas: text mining,
to Text Clustering[J], Journal of Cor -mputer Research and knowledge management, evolutionary algorithms,
DeveloprnenL, 2010,47(10), 1733- 1740. He is Associate Professor at Shandong Institute of Business
[8] PENG Jing ,YANG Dons-Qin, TANG Shi-Wei, FU Yan, and Technology in YanTai of Shandong province.

2012 ACADEMY PUBLISHER


Call for Papers and Special Issues

Aims and Scope.


Journal of Software (JSW, ISSN 1796-217X) is a scholarly peer-reviewed international scientific journal focusing on theories, methods, and
applications in software. It provide a high profile, leading edge forum for academic researchers, industrial professionals, engineers, consultants,
managers, educators and policy makers working in the field to contribute and disseminate innovative new work on software.

We are interested in well-defined theoretical results and empirical studies that have potential impact on the construction, analysis, or management
of software. The scope of this Journal ranges from the mechanisms through the development of principles to the application of those principles to
specific environments. JSW invites original, previously unpublished, research, survey and tutorial papers, plus case studies and short research notes,
on both applied and theoretical aspects of software. Topics of interest include, but are not restricted to:
Software Requirements Engineering, Architectures and Design, Development and Maintenance, Project Management,
Software Testing, Diagnosis, and Validation, Software Analysis, Assessment, and Evaluation, Theory and Formal Methods
Design and Analysis of Algorithms, Human-Computer Interaction, Software Processes and Workflows
Reverse Engineering and Software Maintenance, Aspect-Orientation and Feature Interaction, Object-Oriented Technology
Component-Based Software Engineering, Computer-Supported Cooperative Work, Agent-Based Software Systems, Middleware Techniques
AI and Knowledge Based Software Engineering, Empirical Software Engineering and Metrics
Software Security, Safety and Reliability, Distribution and Parallelism, Databases
Software Economics, Policy and Ethics, Tools and Development Environments, Programming Languages and Software Engineering
Mobile and Ubiquitous Computing, Embedded and Real-time Software, Database, Data Mining, and Data Warehousing
Internet and Information Systems Development, Web-Based Tools, Systems, and Environments, State-Of-The-Art Survey

Special Issue Guidelines


Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by
invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal.
Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the
readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length.

The following information should be included as part of the proposal:


Proposed title for the Special Issue
Description of the topic area to be focused upon and justification
Review process for the selection and rejection of papers.
Name, contact, position, affiliation, and biography of the Guest Editor(s)
List of potential reviewers
Potential authors to the issue
Tentative time-table for the call for papers and reviews

If a proposal is accepted, the guest editor will be responsible for:


Preparing the Call for Papers to be included on the Journals Web site.
Distribution of the Call for Papers broadly to various mailing lists and sites.
Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be
informed the Instructions for Authors.
Providing us the completed and approved final versions of the papers formatted in the Journals style, together with all authors contact
information.
Writing a one- or two-page introductory editorial to be published in the Special Issue.

Special Issue for a Conference/Workshop


A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like
general chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is
typically made of 10 to 15 papers, with each paper 8 to 12 pages of length.

Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop:
Selecting a Title for the Special Issue, e.g. Special Issue: Selected Best Papers of XYZ Conference.
Sending us a formal Letter of Intent for the Special Issue.
Creating a Call for Papers for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees.
Information about the Journal and Academy Publisher can be included in the Call for Papers.
Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus
the evaluation from the Session Chairs and the feedback from the Conference attendees.
Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors.
Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced.
Providing us the completed and approved final versions of the papers formatted in the Journals style, together with all authors contact
information.
Writing a one- or two-page introductory editorial to be published in the Special Issue.

More information is available on the web site at http://www.academypublisher.com/jsw/.


A Quick Emergency Response Model for Micro-blog Public Opinion Crisis Based on Text Sentiment 1413
Intensity
Mingjun Xin, Hanxiang Wu, and Zhihua Niu

A New Text Clustering Method Based on KSEP 1421


ZhanGang Hao
(Contents Continued from Back Cover)

The Application of SPSS Factor Analysis in the Evaluation of Corporate Social Responsibility 1258
Hongming Chen and Xiaocan Xiao

Relationship between Motivation and Behavior of SNS User 1265


Hui Chen

The Load Forecasting Model Based on Bayes-GRNN 1273


Yanmei Li and Jingmin Wang

Research on the Model Consumption Behavior and Social Networks Role of Digital Music 1281
Dan Liu, Tianchi Yang, and Liang Tan

A New Method of Medical Image Retrieval for Computer-Aided Diagnosis 1289


Hui Liu and Guochao Sun

REGULAR PAPERS

A Detailed Study of NHPP Software Reliability Models (Invited Paper) 1296


Richard Lai and Mohit Garg

Confidence Estimation for Graph-based Semi-supervised Learning 1307


Tao Guo and Guiyang Li

Semantically Enhanced Uyghur Information Retrieval Model 1315


Bo Ma, Yating Yang, Xi Zhou, and Junlin Zhou

Formalizing Domain-Specific Metamodeling Language XMML Based on First-order Logic 1321


Tao Jiang and Xin Wang

Framework and Implementation of the Virtual Item Bank System 1329


Wen-Wei Liao and Rong-Guey Ho

An Approach to Automated Runtime Verification for Timed Systems: Applications to Web Services 1338
Tien-Dung Cao, Richard Castanet, Patrick Felix, and Kevin Chiew

Algorithm of Diffraction for Standing Tree based on the Uniform Geometrical Theory of Diffraction 1351
Yun-Jie Xu, Wen-Bin Li, and Shu-Dong Xiu

Biddy a Multi-platform Academic BDD Package 1358


Robert Meolic

Implementation of Multi-objective Evolutionary Algorithm for Task Scheduling in Heterogeneous 1367


Distributed Systems
Yuanlong Chen, Dong Li, and Peijun Ma

Dominancebased Rough Intervalvalued Fuzzy Set in Incomplete Fuzzy Information System 1375
Minlun Yan

A Novel PIM System and its Effective Storage Compression Scheme 1385
Liang Huai Yang, Jian Zhou, Jiacheng Wang, and Mong Li Lee

Analyzing Effective Features based on User Intention for Enhanced Map Search 1393
Junki Matsuo, Daisuke Kitayama, Ryong Lee, and Kazutoshi Sumiya

Achieving Dynamic and Distributed Session Management with Chord for Software as a Service 1403
Cloud
Zeeshan Pervez, Asad Masood Khattak, Sungyoung Leey, and Young-Koo Lee

You might also like