Professional Documents
Culture Documents
ABSTRACT
1 INTRODUCTION
KEYWORDS
Narrative annual report, fraud detection, decision
support, support vector machine, queen genetic
algorithm
121
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
122
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Data Preprocessing
(CKIP System)
Non-Fraudulent Narrative
Annual Reports
Term-Pair
Combination
Keyword Filtering
Non-Fraudulent
Term Library
Finance and
Accounting
Corpus
Download
Fraudulent Narrative
Annual Repors
Market Observation
Post System of
Taiwan Stock
Exchange
Data Preprocessing
(CKIP System)
Term-Pair
Combination
Filtering of
Fraudulent Feature
Terms
Fraudulent Feature
Term Library
Narrative Annual
Report Clustering
Enterprises
Non-Fraudulent Narrative
Annual Reports
Fraudulent Narrative Annual Reports
as Training Samples
Fraudulent Narrative
Annual Reports
Securities Crime
Sentence
Bounced Check of
Chairman of the Board
123
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Non-Fraudulent Narrative
Annual Reports
Fraudulent Narrative
Annual Reports
CKIP System
Finance and
Accounting
Corpus
Stop-Words
List
NO
NO
POS(Wj)=
Stop-Word(STi)
?
Wj=FTi
?
YES
YES
POS(Wj) Removal
Punctuation(Pm) Filtering
Punctuations
List
NO
NO
Wj+1=STn
?
YES
POS(Wj)=
Punctuation(Pm)
?
Term Combination
CTp=Wj+Wj+1
YES
POS(Wj) Removing
NO
Terms from Non-Fraudulent
Annual Reports
All Term(Wj)
Matching
?
YES
END
124
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
IG(C E ) H(C)-H(C E )
C
Non-Fraudulent
Term Library
E
C
Term Matching
Term Selection
(Information Gain)
Fraudulent Feature
Term Library
(2)
where IG(C|E) denotes the information gain of
fraudulent/non-fraudulent
term
E
in
fraudulent/non-fraudulent correlated term class
C, H(C) denotes the entropy of fraudulent/nonfraudulent correlated term class C, H(C|E)
denotes the relative entropy of fraudulent/nonfraudulent term E in fraudulent/non-fraudulent
correlated term class C, p(ci) denotes the
probability of fraudulent/non-fraudulent correlated
term class C, p(ej) denotes the probability of
fraudulent/non-fraudulent term E; and p(ci|ej)
denotes the probability of fraudulent/nonfraudulent term E conditional on the occurrence
of fraudulent/non-fraudulent correlated term
class C .
3.4 Narrative Annual Report Clustering
ni, j
n
k
n
IDFi log
df i
,
k, j
(1)
where
is the frequency of term i appearing
on a fraudulent/non-fraudulent document j,
is the frequency of term i appearing on
fraudulent/non-fraudulent documents,
is the
number of term i appearing on fraudulent/nonfraudulent document j,
is the total
number of all terms appearing on fraudulent/
non-fraudulent documents, is the total number
of fraudulent/non-fraudulent documents, and
is the number of fraudulent/non-fraudulent
documents with term i.
Scorem
TFIDF
i, m
(3)
nm
where
represents the weighted score of
the fraudulent feature term,
represents the
total number of words in the m-th article, and
125
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Dm 1 M(qi di )
where H(x) denotes the class index of QGASVM, ht(x,y) denotes the class index of SVM,
and t denotes the weight of SVM.
Finally, the testing data set is input into the
QGA-SVM clustering model to determine the
narrative annual report clustering results (i.e.
fraud or non-fraud).
Support Vector Machine Algorithm
(4)
F(d i ) rank(Dm 1 )
Testing Dataset
Sample Normalization
Parameters c and g
Selection
(6)
i 1
x- xi 2
K(x, xi ) exp 22
Non-Fraudulent
Narrative Annual
Reports
(7)
Calculation of Chromosome
Fitness
F (di )
The Fitness>Threshold ?
NO
YES
Radial Basis Functions
Kernel K(x, xi ) Selection
Training Dataset
(5)
Where
denotes the fitness value,
denotes the primal objective function,
denotes the randomly selected fitness function
in the optimal function sequence, and
denotes the randomly selected fitness function
in all function sequences.
1
) ht (x, y)
t
Fraudulent
Narrative Annual
Reports
Optimal Solution
(Optimal Parameters)
Random Selection of
Chromosome from Queen
Cohort qi
Random Selection of
Chromosome from Whole
Population di
Crossover
Dm1 qi di , is crossover
Mutation
Dm1 M ( Dm1 ), M is Mutation
126
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Sample
Fraudulent
Reports to
Shareholders
Non-Fraudulent
Reports to
Shareholders
15
45
10
30
Training
Dataset
Testing
Dataset
Parameter
Name
QGA
Population
QGA
Evolution
QGA
Threshold
c and g of
SVM
Value Set
20
200
0.9
Based on the
results of QGA
127
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
NonFraudulent
Fraudulent
Reports to
Reports to
Shareholders
Shareholders
10
30
Testing Sample
Total
Correctly
Identified
Incorrectly
Identified
P-Value
Detected Upper
at
0.01 level Lower
25
1.53x10-09
6
1.93x10-66
29
Clusterin
g Model
Decision
Tree
---
---
Bayes
---
---
PNN
---
---
GridSVM
PSOSVM
0.108
8
0.137
6
0.250
GA-SVM
3
QGA- 5.205
SVM
8
27.857
6
62.412
4
11.898
9
12.634
3
Elapsed
Accuracy
Time
73.3333
--%
70.0000
--%
78.3333
--%
2.35588 80.0000
0
%
7.95027 82.5000
7
%
2.71258 82.5000
5
%
2.94319 85.0000
3
%
5 CONCLUSIONS
128
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
129