You are on page 1of 6

INTERNATIONAL JOURNAL FOR RESEARCH & DEVELOPMENT IN Volume-11,Issue-4(Apr-19)

TECHNOLOGY ISSN (O) :- 2349-3585

Improve Software Effort Estimation using Weighted KNN in


New Feature Space
__________________________________________________________________________________________
Mohammad Javad Madari1, Hossein Bahrami2, Mehrnaz Niazi3
Dept. Electrical and Computer Engineering, Pishtazan Institute of Higher Education, Shiraz, Iran

Abstract – According to increment of software and code). There is a method to predict LOC as name UCP (Use
application requisition, present a stringent and reliable Case Point). Then can predict output with COCOMO II
model for effort estimation is necessary. In this paper a Equation [1]. There is a novel model of cascade correlation
feature extension method has been proposed, in order to neural network with cross validation which utilizes use case
increase the accuracy. In this way quadratic mapping has diagram. it seems that this model can be an alternative method
been used as feature expander. Quadratic mapping construct to predict software estimation [3]. Integrating fuzzy technique
more discriminative features, therefore it can gain better which weights the features with UCP can conclude reliable
result. Although use mapping causes dimensional increment, prediction. Fuzzy technique uses to calculate UCP coefficients
the results will be more accurate, especially when using W- [4]. Modification of UCP by definition Mobile Complexity
KNN (Weighted-K nearest neighborhood) as regression Factor (MCF) coefficient can gain better result in mobile
model. It should be mentioned, according to shortage of data application. MCF obtains based on MTDI factor and result of
in this field, dimensional increment, don't cause a serious some interviews with mobile developers [5]. Analogy-based
problem, and processing can be run on a usual home pc. effort estimation can be a reliable method for software effort
estimation because of its capability of handling noisy datasets.
Introduction This approach estimates effort based on k nearest analogies
It is clear that there are 2 approaches in data analysis. First is [6]. Combination of 2 different datasets with similar features
functional analysis, and second is statistical analysis. Using from a company will decrease the error of estimation
the first approach limits to a specific dataset and needs a quite significantly in statistical approaches [7]. A systematic
intuition on dataset. But the second approach is public and mapping named ASEE techniques have gained considerable
usable in any dataset. So, it is more common than first result in software effort estimation, especially when combine
approach. Also, implementation of second approach is simpler this approach with fuzzy logic or genetic algorithm [8]. A
than first approach. Constructive Cost Model (COCOMO), comparison between multilayer perceptron (MLP) , general
one of the most famous functional approaches, developed by regression neural network (GENN), cascade correlation neural
Barry W.Boehm in 1970s. This method used a predefined network (CCNN), radial basis function neural network
formula to compute software effort estimation. But it was not (RBFNN) showed CNN outperformed other models[9].
accurate enough. COCOMO II has been developed by Automatically transformed linear Model (ATLM) is a
COCOMO in 2000. As COCOMO II is more sufficient than stringent baseline model for comparison against software
earlier version it gained a lower rate of error [1]. We are now effort estimation methods. ATLM can be used as a baseline of
investigating several new papers: A successful approach to effort prediction for every future model in effort estimation
select important features is correlation. Even Decision tree [10].
Can use in a intelligently way to select features more related to BASIC CONCEPTS
output. And evolutionary SVM can be a valid method to In this part, we first explain quadratic mapping which is a kind
predict new value with low MMRE [2]. One of most important of feature expander then we review 4 statistical methods in
features in every datasets of effort estimation is LOC (Line of detailed.

421
All rights reserved by www.ijrdt.org
Paper Title:-Improve Software Effort Estimation using Weighted KNN in New Feature Space

FEATURE EXPANDER METHOD


𝑝 𝑐, 𝑣𝑗 𝑥
QUADRATIC MAPPING 𝑝(𝑐|𝑣𝑗 𝑥 ) = 𝑛 (5)
𝑙=1 𝑝(𝑐𝑙 , 𝑣𝑗 𝑥 )
Mapping data to higher dimensional space can create more
Where 𝑣𝑗 (x) is terminated node. When j=1,2,…,t
effective features to discriminate data. Quadratic mapping 𝑡
provides a higher order model which sometimes gain to a 1
𝑔𝑐 𝑥 = 𝑝(𝑐|𝑣𝑗 (𝑥)) (6)
better separation [11]. For example, after utilizing quadratic 𝑡
𝑗 =1
mapping on a data with 2 features {x_1, x_2} higher Decision rule try to assign x to class c for which 𝑔𝑐 𝑥 is

dimensional space will create {x_1^2, √2 x_1 x_2, x_2^2}. max[14]. The following Eq (5), (6) utilize for classifier

STATISTICAL REGRESSION METHOD version of Random Forest.

K NEAREST NEIGHBORHOOD 𝑗
1
1-Nearest neighborhood measures distance between test and 𝑓 𝑥 = 𝑕𝑗 𝑥 (7)
train data. And choose output of nearest train data as output of
𝑗
𝑗 =1
new test data. Euclidean distance usually utilizes as a Which 𝑕𝑗 (x)s included of 𝑕1 𝑥 , 𝑕2 𝑥 , … , 𝑕𝑛 (x) are

confident way to measure distance[12]. outputs of every node. And average of their quantity is output
of Random Forest Regression model[15]. It should be
𝐷𝑖𝑠𝑡 𝑥𝑗 , 𝑥𝑖 = 𝑥𝑗 − 𝑥𝑖 1
Eq (1) shows how to calculate Euclidean distance. In uniform mentioned using Random Forest with cross validation usually

KNN, average of output of k nearest neighbors calculate and gain better result[16].

set as output of test data. NEURAL NETWORK


Artificial Neural Network (ANN) is result of generalization
1
y= 𝑦 + 𝑦2 + ⋯
𝑘 1 of human brain. Multilayer perceptron (MLP) is a special
+ 𝑦𝑘 (2) structure of neural network included a large number of
Eq (2) shows uniform KNN. Where y is predicted value. And processing units in different layers called neurons. Indeed,
yk s are outputs of k nearest neighborhood. Averaging is not each neuron in MLP structure is a mathematical function
exactly the most accurate way to calculate output[13]. In this which connected to other neurons of previous and next layer.
paper a Gaussian kernel has been used to weight everyyk . It should be mention that neurons of first layer connect to
k1 y1 + k 2 y2 + ⋯ + k n yn input data or features[17].
y= n (3)
j=1 k n
k j = exp⁡(−(xtest − xtrain j )2 ) (4)
𝑘𝑗 s are weights of outputs of k nearest neighborhoods. It
should be mentioned all features in this approach has been
scaled between -1 and 1. Using weighted 𝑘𝑗 can conclude
better result usually[13].
RANDOM FOREST
Random Forest is constructed from multiple decision trees and
gets more effective predictor in both regression and
classification[14]. However an important difference is that
decision tree with large number of nodes tends to over fitting
easily, But Random Forest avoids to over fitting by creation
random subsets at the features and constructing smaller trees
using this subsets[14].

422
ISSN:-2349-3585 |www.ijrdt.org
Paper Title:-Improve Software Effort Estimation using Weighted KNN in New Feature Space

of that cluster will be changed.

This algorithm continues until the means of each cluster not


change along 1 iteration[19]. In this paper after determining
clusters, criterion of measuring distances between test data and
members of nearest cluster is hamming distance.

Hamming Distance
Number of unequal columns (x, y)
= (12)
Number of total columns of x or y
Eq (12) shows how to measure hamming distance[20]. Where
x is test data and y is nearest cluster of train data to x. Indeed,
before start predicting with hamming distance, all data (train
and test) have been labeled with mentioned approach in
Kmeans. Then for predicting, hamming distance have been
measured between test data and part of train data with same
Figure 1: shows multilayer perceptron with 2 hidden label of test data.
layers
The main problem in neural network is updating weights and
biases of each neuron which is usually execute with back
propagation algorithm.

𝑆 𝑀 = −2 𝐹 ′𝑀 𝑛𝑀 𝑡 − 𝑎 (8)

Where SM is sensitivity of last layer and F(n) is transfer


function of last layer. Also, t is target and a is predicted value.

𝑆 𝑚 = 𝐹 ′𝑚 𝑛𝑚 (𝑤 𝑚+1 )𝑇 𝑆 𝑚 +1 9 Figure 2: shows distance between test data and a part of


𝐹𝑜𝑟 𝑚 = 𝑀 − 1, … , 2,1. train date with same label. Green circles are means of
every cluster. Blue circles are train data. And yellow circle
Where 𝑆 𝑚 is sensitivity of other layers and m is number of
is test data.
each layer.
PROPOSED METHOD
In this paper quadratic mapping has been utilized to make
𝑊𝑚 𝑘 + 1 more accurate model. But in NASA and COCOMO datasets
= 𝑤 𝑚 𝑘 − 𝛼𝑆 𝑚 (𝑎𝑚 −1 )𝑇 (10)
with less than 70 samples with 16 input features, quadratic
𝑏 𝑚 𝑘 + 1 = 𝑏 𝑚 𝑘 − 𝛼𝑆 𝑚 (11)
weights and biases will update with Eq (10),(11)[18]. mapping causes a kind of redundancy because of 136 columns
which will produced. Therefor in this paper some input
KMEANS and HAMMING DISTANCE features with correlation more than ±0.2 regarding to
k-means clustering is a simple way to divide data into some ACT_Effort has been selected for multiplying part of
distinctive parts. This division do based on means of each quadratic mapping which is 2𝑥1 𝑥2 , These 9 features are
part[19]. Every new input allocates to a cluster then the means

423
ISSN:-2349-3585 |www.ijrdt.org
Paper Title:-Improve Software Effort Estimation using Weighted KNN in New Feature Space

high- lighted in Table 2. According to the number of them,


pairwise multiplying causes 36 input features. It should be
mentioned all 16 input features have been utilized for other
part of quadratic mapping which is x2. Therefore, after
mapping number of columns will be 36+16=52.
Table 1: shows features of NASA and COCOMO datasets.
Feature description
1 ACAP analyst’s capability
2 PCAP programmer’s capability
3 AEXP application experience
4 MODP Modern programming practices
5 TOOL use of software tools
6 VEXP virtual machine experience
7 LEXP language experience
8 SCED schedule constraint
9 STOR main memory constraint
Figure3: correlation result of all features.
10 DATA data base size
11 TIME time constraint for CPU Table 2: shows correlation of 16 input features regarding
12 TURN Turnaround time to
13 VIRT machine volatility Feature Correlation
14 CPLX process complexity 1 ACAP -0.1
15 RELY required software reliability 2 PCAP 0.2
16 LOC line of code 3 AEXP -0.2
17 ACTUAL actual effort(output) 4 MODP 0.4
EFFORT 5 TOOL 0.3
In this paper data has been separated in 2 parts in every 6 VEXP 0.2
training process. 80% of samples have been assigned to train 7 LEXP 0
data and other part which is 20% have been assigned to test 8 SCED -0.1
9 STOR 0.2
data. It should be mentioned that all four regression methods
10 DATA 0
introduced in Basic Concepts have been run five times in 11 TIME 0
order to utilize each sample 1 time as test and 4 times as train 12 TURN 0
data. Finally, average of 5 results have been calculated. 13 VIRT 0
14 CPLX 0.2
15 RELY 0.3
16 LOC 0.9
EXPRIMENTAL RESULT
MMRE and PRED30 are 2 usual errors to determine accuracy
of models in effort estimation.
𝑅𝐸. 𝑖
(𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒. 𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙. 𝑖)
= (13)
𝑎𝑐𝑡𝑢𝑎𝑙. 𝑖

𝑀𝑅𝐸. 𝑖 = 𝑎𝑏𝑠(𝑅𝐸. 𝑖) (14)

𝑀𝑀𝑅𝐸. 𝑖
100
= (15)
𝑇 ∗ (𝑀𝑅𝐸. 1 + 𝑀𝑅𝐸. 2 + ⋯ . 𝑀𝑅𝐸. 𝑇)

424
ISSN:-2349-3585 |www.ijrdt.org
Paper Title:-Improve Software Effort Estimation using Weighted KNN in New Feature Space

[1]C. Nagar and A. Dixit, “Efforts Estimation by combining


Where T is length of data.Eq (13), (14), (15) shows how to the Use Case Point and COCOMO,” Int. J. Comput. Appl.,
calculate MMRE. vol. 52, no. 7, pp. 1–5, 2012.
100 [2]T. Mahboob, S. Gull, S. Ehsan, and B. Sikandar,
PRED30 = (16)
T ∗ count “Predictive Approach towards Software Effort Estimation
For calculating PRED30 should count a variable when the using Evolutionary Support Vector Machine,” Int. J. Adv.
MRE.i is equal or smaller than 0.3 and put it in Eq (16). Comput. Sci. Appl., vol. 8, no. 5, pp. 446–454, 2017.
Table 3: shows the result of MMRE and PRED30 on [3]A. B. Nassif, L. F. Capretz, and R. Hill, “Software Effort
NASA dataset. Second column shows MMRE and PRED30 Estimation in the Early Stages of the Software Life Cycle
without using any kind of mapping. Third column shows Using a Cascade Correlation Neural Network Model,” ACIS
MMRE and PRED30 with quadratic mapping. Int. Conf. Softw. Eng. Artif.Intell.Netw. Parallel/Distributed
16 input features Proposed method Comput., pp. 589–594, 2012.
shown in Table 2
Methods [4]M. Grover, P. K. Bhatia, and H. Mittal, “Estimating
MMRE PRED30 MMRE PRED30
Software Test Effort Based on Revised UCP Model Using
W-KNN 85.26 18.14 36.96 27.8 Fuzzy Technique Estimating Software Test Effort Based on
RF 54.80 34.41 58.41 38.02
Revised UCP Model Using Fuzzy Technique,” Inf. Commun.
MLP 98.21 21.82 74.38 17.8
K-means+ 75.32 19.17 57.85 31.1 Technol. Intell. Syst., no. March, 2018.
Hamming
dist. [5]A. Mahi and K. Kaur, “Effort Estimation for Mobile
Table 4: shows the result of MMRE and PRED30 on Applications using Use Case Point ( UCP ),” no. June, 2017.
COCOMO dataset. Second column shows MMRE and [6]A. B. Nassif and M. Azzeh, “Analogy-based effort
PRED30 without using any kind of mapping. Third estimation: a new method to discover set of analogies from
column shows MMRE and PRED30 with quadratic dataset characteristics,” IET Softw., vol. 9, no. 2, pp. 39–50,
mapping. 2015.
16 input features Proposed method [7]E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer
shown in Table 2
learning in effort estimation,” Empir. Softw.Eng., vol. 20, no.
Methods
MMRE PRED30 MMRE 3, pp. 813–843, 2015.
PRED30
[8]A. Idri and A. Abran, “Analogy-based software
W-KNN 103.67 19.18 84.17 21.67 development effort estimation : A systematic mapping and
RF 173.54 26.27 123.27 26.45
review,” Inf. Softw. Technol., 2014.
MLP 171.14 8.82 131.86 16.67
K-means+ 124.02 17.43 133.41 22.05 [9]A. B. Nassif, M. Azzeh, L. F. Capretz, and D. Ho, “Neural
Hamming network models for software development effort estimation: a
dist.
CONCLUSION comparative study,” Neural Comput. Appl., vol. 27, no. 8, pp.
In this paper quadratic mapping has been used as feature 2369–2381, 2015.
expander and four regression models named W-KNN, [10]P. A. Whigham, C. A. Owen, and S. G. Macdonell, “A
Random Forest, Multilayer Perceptron, Kmeans and Hamming Baseline Model for Software Effort Estimation,” vol. 24, no.
distance have been utilized. Although quadratic mapping maps 3, 2015.
data to a higher dimensional space it gains better result in [11]R. Herbrich, Learning Kernel Classifiers Theory and
more cases as it has been shown in experimental result. Best Algorithms. 2001.
MMRE belongs to W-KNN and best PRED30 belongs to [12]K. Yu, L. JI, and X. Zhang, “Kernel Nearest-Neighbor
Random Forest. Algorithm,” no. 2002, pp. 147–156, 2002.
REFERENCES [13]E. Fox and C. Guestrin, “Going nonparametric : Nearest

425
ISSN:-2349-3585 |www.ijrdt.org
Paper Title:-Improve Software Effort Estimation using Weighted KNN in New Feature Space

neighbor and kernel regression,” in Machine Learning


Specialization, 2015, pp. 1–61.
[14]Tin KamHo, “Random Decision Forests,” Proc. 3rd Int.
Conf. Doc. Anal. Recognit., pp. 278–282, 1995.
[15]J. R. S. Adele Cutler, D. Richard Cutler, Ensemble
Machine Learning ; Random Forest. Springer New York
Dordrecht Heidelberg London: Springer, 2012.
[16]D. R. Cutler et al., “RANDOM FORESTS FOR
CLASSIFICATION IN ECOLOGY,” Ecology, vol. 88, no.
11, pp. 2783–2792, Nov. 2007.
[17]P. Rijwani and S. Jain, “Enhanced Software Effort
Estimation Using Multi Layered Feed Forward Artificial
Neural Network Technique,” ProcediaComput. Sci., vol. 89,
pp. 307–312, 2016.
[18]M. H. B. T. Hagan, B.Demuth, “Neural Network Design,”
vol. 357–365, pp. 125–127.
[19]J. Wu, H. Liu, H. Xiong, and J. Cao, “A theoretic
framework of K-means-based Consensus Clustering,” IJCAI
Int. Jt. Conf. Artif. Intell., pp. 1799–1805, 2013.
[20]M. Tang, Y. Yu, W. G. Aref, Q. M. Malluhi, and M.
Ouzzani, “Efficient Processing of Hamming-Distance-Based
Similarity-Search Queries Over MapReduce,” Edbt, pp. 361–
372, 2015.

426
ISSN:-2349-3585 |www.ijrdt.org

You might also like