You are on page 1of 10

Smartness in Code review

Project Report
Ananya Kalhan Dhar Naga Malleswara Rao Y
University of Waterloo University of Waterloo University of Waterloo
200 University Avenue 200 University Avenue 200 University Avenue
Waterloo, ON Canada Waterloo, ON Canada Waterloo, ON Canada
aananya@uwaterloo.ca kdhar@uwaterloo.ca nmyagant@uwaterloo.ca

ABSTRACT the master code base of a software system. It helps to de-


Code review is an important tool for maintaining the quality tect simple source code defects like logical errors and coding
and design of a code base in the field of software develop- standard violations e.g., lack of descriptive identifier names,
ment. It is intended to find coding mistakes that may be in the early phases of the development, which reduces the
overlooked during the development phase and therefore re- overall cost to fix a bug later on.
duce the risk of bugs in the final deliverable or when the Nowadays, many Open Source Software (OSS) use a mod-
software is deployed in production. However, there are po- ern code review system like Gerrit1 , GitLab2 and Review-
tential hidden costs related to code reviews as well. These Board33 to archive the records of code review activities in
include time spent by the reviewers, time spent by author’s their repositories. Modern code review (MCR) has therefore
addressing the feedback, time spent by the commit-change become a vital and essential part of OSS.
waiting in process. High Development time for developers Despite the benefits of code reviews, they are a great bot-
is very beneficial for software development companies and if tleneck in the development process. Reviews can take a long
a big chunk of time is spent reviewing code, there is lesser time to get approved thereby increasing the overall software
time available to build actual software. In this project,we development time and thus affecting a developer’s produc-
first perform an exploratory data and statistical analysis on tivity. Especially for OSS projects, where developers work
the code and developer behaviour, mined the code review from different locations and time zones, code reviews some-
repository to identify and identify the types of errors devel- times have significant delays in acquiring the first few com-
opers usually make, categorize them into the common error ments and feedback. Analyzing how reviewers react to a
themes that occur in code reviews, and potentially create CR, and improving the CR based on that analysis before-
a system that can identify some of those mistakes before a hand, kick-starting the actual review process afterwards, can
reviewer reviews the code. Such a proposed prediction sys- shorten the duration of the developer review process. There-
tem can make the code review process faster by automatic fore, having such a system that predicts what comments re-
code review error detection based on file errors and previ- viewers might make on a CR and point common theme errors
ous comments received on such error files. We also propose a in the code review can help developers improve their code at
recommendation system that can recommend peer reviewers a very early stage so that they have a higher chance of get-
for a given piece of code. The system can suggest peers who ting a successful ship-it with the least number of revisions.
are better suited to review the code. Finally we presented This will not only reduce costs in the software development
an algorithm that can potentially improve the accuracy of process and enable the developers to accomplish the same
such prediction and recommendation systems. amount of work in less time but can also lead to increased
job productivity and satisfaction. Also, the same tool can
be used to provide decision-making support for developers
Keywords to identify most appropriate peer reviewers for their code
Code review, Prediction Engine, Reviewer Recommenda- changes based on their expertise and collaboration in past
tion, Statistical Analysis reviews.

1. INTRODUCTION 1.1 Research Questions


Code reviews are one of the important software develop- In this project, as part of building the system for helping
ment activities in the software industry. It works as a qual- developers accelerate the code review process, we tried to
ity safeguard for code patches that are to be integrated into answer the following research questions using exploratory
data and statistical analysis along with machine learning:

• Are there any patterns in the code review process?


What are the insights that can be derived from the
existing process by performing a statistical analysis on
the code review repository?
1
UW CS 846 Winter 2018 https://www.gerritcodereview.com/
2
© 2018 ACM. ISBN 123-4567-24-567/08/06. . . $15.00 https://gitlab.com
3
DOI: 10.475/123 4 https://www.reviewboard.org/
• What are the common comments that are received in required for further analysis. We observed that this data
the Code review? Can they be classified to understand was not entirely present in the data-sets that we initially
where the team can improve? Can we identify common fetched from the Gerrit APIs, for all the change ids and
themes in the codes based on the comments? Can revision ids that were present in the Code review data-set.
these error themes be identified before the code review Also, since the data was enormous, we limited our analysis
process begins? only on Java source code files. We used GetContent API
to fetch the contents of the file from using its revision and
• Can we identify ideal peer reviewers for a given code change id. During the analysis we found that a large number
review based on their past review work? of the comments like (build started, build finished, URLs to
build reports etc) were generated by the Gerrit bots. These
Our ultimate goal is to potentially build a system that can were not at all useful in the prediction of the error themes as
predict the common errors in the code review before a re- we wanted to specifically target comments by reviewers and
viewer can review it and then suggest the ideal reviewers for therefore we removed all the system generated comments
analyzing that patch. and bot comments. The List Change Comments API by
Gerrit was used to fetch only the comments that were made
2. DATASET by the actual code peers and not the bots.
To understand how developers review code, we need a real Below are an example of bot comments that we had to
data-set from standard software development teams. Differ- filter out while creating the smart prediction engine. These
ent teams use different internal and external tools to re- comments have little to no insight to the actual error in the
view the code and provide feedback for their peers. For code files being reviewed.
our work, we used a mined data-set which was provided
as part of MSR 2016 [14]. This data-set brings together Patch Set 2:
all of the code review data from five open source projects Build Started http://10.24.20.18:30081/job
under the same representation. The data-set we used is a /images/job/att-comdev/job/armada/job/
subset of the original data-set which contains code review armada/556/ (1/2)
information from Gerrit, a standard open source code re- Build Successful
view tool where developers can comment and give a score http://12.37.173.196:32775/repository
to the code being reviewed. We used the code review infor- /att-comdev-jenkins-logs/Codereview/att-comdev
mation from GerritHub which was organized and hosted in /armada/405720/att-comdev/
a MYSQL database on Microsoft Azure. Summary of data- armada-405720-6106.log : SUCCESS
set can be seen below. It has information about the patches, 12.37.173.196:32775/repository/
comments, Reviewers. att-comdev-jenkins-logs/
att-comdev/armada/575/armada-575
2.1 Dataset Schema :SUCCESS
The data-set has five tables. The main entities of the
schema are as follows:
3. METHODOLOGY
• Change - The change table represents an instance of a
code change that is in the review system. The table 3.1 Exploratory Data and Statistical Analysis
also contains relevant information such as the author
of the code change 3.1.1 Identifying average number of days to push code
• Revision - As a change gets reviewed, it may undergo Analysis of average number of days a team takes to push a
several revisions of the source code before it is commit- change to the VCS can help the team better plan resources.
ted. The revision holds information, such as the final We analyzed 92 projects from the Gerrithub which have at
commit date of the code change least 100 changes to identify how teams work with changes.
We took projects with at least 100 changes because we as-
• People - The people table was created to store all de- sumed by the time the team hits 100 changes, the team of
tails of the review members. Each member can be developers would have established their processes.
identified with a unique id. If there were no bottlenecks then almost all of the projects
should have fallen under the bucket of lesser day periods.
• History - The history table contains all messages or The data loosely portrays the fact that this does not happen.
comments related to a review. The history table con- Changes take a lot of time to be pushed. There can be many
tains the messages attribute that can be used to iden- reasons that lead to this, for example, size of the change,
tify all comments and activities related to the review reviewer availability and certain code review bottlenecks.
process We are interested in exploring solutions for providing early
feedback.
• File - The file table contains the details of the code
changes. This table contains information such as the
3.1.2 Understanding the time to first code response
path name , and size of the code change.
Our hypothesis was that the code authors have to wait a
These tables are rich sources of information for answering long time for their peers to respond after the code is sent for
the first research question.However, in order to identify error a review (Response refers to a peer’s comment or a review
themes in the code and recommend reviewers to new code, score) For this analysis, we took the same set of projects
actual code files and their respective human comments are again. The results can be seen in figure 2
Figure 1: Average number of days to push code Figure 3: Contributions by week of the day

can encourage more collaborations. Maintainers can exper-


iment with this workflow. Collaboration in intra commits
will create working version of products and master branch
will always have working code.

Figure 2: Average time to first response

The results indicate that there is definitely a choke point.


In more than 50% of the projects, the response time is more
than a week. This means that for a week, the author has no
clue or direction as to how they can improve the code. This Figure 4: Avg number of contributors per change
causes a huge wastage in software developer productivity
and the overall team development time.
3.1.5 Distribution of Review scores
3.1.3 Understanding Code Contribution Activity We wanted to see how reviewers use the Gerrit features
An interesting fact to know for project maintainers is how to subtly appreciate or protest a change. An analysis of 20
people create the code patches on each day of the week. projects is presented in figure 5. A closer look at the trend
We analyzed the code review data and we found that most indicates that people are not confidently endorsing changes
of contributions are done on Thursday and weekends saw to be pushed into production. Overall there is no clear trend
the least number of contributions. The results can be seen on how developers are using +1 and -1 across projects.
in figure 3. Tuesdays, Wednesdays and Thursdays saw the
most number of revisions and is the peak period. This could 3.2 Identification & Prediction of Error themes
mean that if most of the revisions are published on these
days, they do not get early or enough comments to be again 3.2.1 Analysis and classification of the comments
posted for review the following day. Such revisions may have We performed analysis on the comments given by the peer
to wait again for this peak period in the following week. reviewers on Java files. The first step was to perform entity
based text extraction. For this, we fetched all the com-
3.1.4 Finding average number of people contribut- ments into our database and preprocessed the words. We
ing in a change removed all the stop words from the comments and created
Gerrit allows multiple people to collaborate on a commit. a word cloud to see the words with high frequency Fig.6.
Figure 4 shows average number of people contributing to a With the visualization of the common words and research
single commit. The average has always been below 2 in all done by Czerwonka et.al.[2], we were able to take identify 6
the projects. The average number of people contributing in broad error themes -
a change is low. This means few people have context on
the code and it might lessen their likelihood of comment- • Documentation - Errors regarding comments, naming
ing on the code. Productivity might increase if maintainers and style of the code
Figure 5: Average number of review votes per project

• Organization - Issues regarding modularity, location of


artifacts, new class, duplicate code, size of methods.
• Validation - Lack of improper validation of the code.
• Visual Representation - Issues regarding the visual rep-
resentation like beautification, blank line, indentation
etc.
• Defect - Errors regarding incorrect implementation or
missing functionality.
• Logical - Issues in the control flow and logic.
If the error cannot be classified in any of the above themes
then we categorized it as UNKNOWN error, which becomes
the seventh error theme for our error prediction engine.
With the above analysis we were able to make links be-
tween which words can be used to identify an error theme.
For an instance, the comments which contains keywords like
methods, imports and class have a higher likelihood of be-
ing related to Organization and Defect errors where as com- Figure 6: Wordcloud visualization of most common
ments consisting of words like white space, blank lines have words in comments
a higher likelihood of being related to the Visual Represen-
tation errors.
We performed a unigram, bigram and trigram analysis
analysis. Once the rule set was created we were able to link
over the comments data using a N-gram analyzer 4 . The
the inference derived from the comments and classified them
findings of this analysis helped us to create a rule based
into error themes.
classifier for error theme segmentation. The rules were
Out of the 2275 comments that were fetched for 936 Java
created by adding the keywords relevant to identification of
files, we were able to categorize the comments that were
the error theme to that particular categorization set. Table
present for each java file into a single error theme, by cal-
1 shows some of the important keywords that were added
culating the occurrences of the each of the keywords and
to the error theme set based on the unigram and bigram
assigning the error theme for which maximum number of
4
http://guidetodatamining.com/ngramAnalyzer/index.php keywords were present. After doing this, we were able to
Error Themes Keywords (Unigrams and Bigrams)
fix wiki, no comment, missing comment,missing
Documentation fields, javadoc, todo, annotation,copyright,typo while(root > x/root
rename || root+1 <= x/(root+1)){
unused import, new class, fix patch, duplicate if(root > x/root){
Organization code, not needed, file structure, conflict,pattern, upperbound = root;
rearrange
integration test, unit test, tests fail, test code,
} else {
Validation lowerbound = root;
add test, error message, debug, invalid, valid
Visual Representation
white lines, white space, empty line, blank line, }
missing space, padding, camel, tabs, indentation root =
seems wrong, is incorrect, didn’t work, wrong
Defect lowerbound +
because, unfinished, defect, violation, broken
not logical, control flow, function parameter, (upperbound - lowerbound)/2;
Logical }
infinite, throws, deserialization, constructor

Table 1: Important Keywords for each error theme Code B


double squareRoot = number/2;
classify 720 of the 936 files into one of the six error themes do
as seen in Table 2. The rest 216 files were assigned UN- {
KNOWN error status since our rule based algorithm was g1=squareRoot;
unable to classify it in any single category. squareRoot =
We encountered certain scenarios where a single file was (g1 + (number/g1))/2;
labelled to multiple error themes i.e., there was a tie in clas- }
sifying a file between two error themes. In such cases, we while((g1-squareRoot)!=0);
took the severity of the error theme into consideration. We For an instance, if we take a look at the two java code snip-
assigned a severity score as seen in Table 2, from 0 to 5 based pets, both are logically equivalent. They calculate square
on which type of error theme should be given priority. We root of a number, however they do so, in two different ap-
gave logical errors the highest priority and visual represen- proaches. For example, there is a high chance of integer
tation the lowest after understanding what were the most overflow error in the first case where as the second can have
critical errors as mentioned in [13]. This suggests that if a an infinite loop condition. Both the cases can have Logical
file has equal number of keyword occurrences from both the Errors. However, if we use number of unique variables as a
visual representation error set and the logical error set, then feature, these two similar pieces of code can fool the classi-
the file would be categorized into logical error theme and fiers and report violations which are specific to this case.
not the visual representation theme. Number lines of code is another deceiving feature. A
small code can do a better job at fulfilling a feature or it
Error themes Count of files Severity can also do worse. Continuing our square root example
Documentation 66 1 Code C:
Organization 218 3
Validation 31 2 Math.sqrt(x)
Visual Representation 71 0
Defect 81 4 Another example is Code D:
Logical 253 5 Math.sqrt((int) x)
UNKNOWN 216 -1
Although they look almost the same, valuable floating
Table 2: Number of files classified and the severity point information is lost while passing it to the square root
of each of the error themes function. An error marked with this feature set can only
predict a similar error if only same code is submitted again
without much changes. Although this kind of features are
3.2.2 Extraction of features from the code files useful to a certain extent, these shallow features alone are
In order to create a classifier that can predict error themes not enough for predicting an error theme from a piece of
on the code review, we segmented java code files into a set code. Therefore for our experiment, we needed features that
of features. There are several approaches to tackle this. One report possible violations in a code. These violations are a
approach is to parse the text into individual tokens and feed better source of information for classifiers.
it to a classifier. These individual tokens could be the unique We looked into a static code analysis tool for java, Check-
identifiers of the code. These include features like number of style5 . It is a tool that helps Java programmers identify po-
variables, loops, the code itself etc. We experimented with tential issues with the code without executing the code and
these features, however, we found that this is not the ideal also identify code style violations. These coding standards
way to proceed. and code behaviors can potentially help identify different
Code A kinds of errors in the code files submitted by reviewers. We
forked the Checkstyle package and modified the underlying
int lowerbound = 1;
work-flow. Class hierarchies are altered to suit our needs.
int upperbound = x;
This error information was fed like an extra interesting in-
int root =
formation to our centralized data collector singleton service
lowerbound +
5
(upperbound - lowerbound)/2; http://checkstyle.sourceforge.net/
which is connected to our databse. Checkstyle is designed decision trees to decide the final class of the test object. We
in such a way that when the file has an error it reports that used a R package for Random forest to generate a model.
type of error and logs a detailed definition of the violation. The inbuilt algorithm uses 501 decision trees. Random for-
We used this mechanism to create two family of features. est has default for both the number of trees ”ntree” and the
number of variable per level ”mtry”. The default configura-
• Violation Frequency: There were about 160 violations tion of Random Forest for ”mtry” is quite sensible so there
in checkstlye. Instead of logging the violation for each is not a need to interfere with it. However, we can change
java file fed to Checkstyle, we kept track of events be- the value of ntree. We used 500 as value of ntree.
ing triggered whenever the class object of a certain
violation was invoked. An example of the types of vi- 3.2.4 Smart Reviewer Recommendation Engine
olations considered are shown in Table 3. We created We also found the above created features are very useful in
a singleton class that keeps track of such code viola- identifying ideal reviewer for a system. A reviewer is termed
tions. Once all the violations were captured, we con- as an ideal reviewer if they can help the author with their
nected it to our database by creating our own database code reviews and can give useful and actionable comments.
communication service and fed these violations into For the purpose of this analysis, we choose one of the larger
our feature table hosted in a central resource. Our projects in Gerrithub in terms of the java files uploaded for
unique identifiers were a composite key of change id, review. midonet/midonet is the project we experimented
patch set number and file name. Therefore, all the with it had a large number of files, 678 and around 20 re-
features for a file of a particular revision, change and viewers who commented frequently. We choose one project
patch were pushed to this feature repository. for our analysis because the authors were bound by projects.
There were only a few cases where authors were working on
• Absolute Features from Checkstyle: Machine learning different projects. However, it was very difficult to link use-
is great in identifying unknown functions from under- ful comments by the authors that affected the code change.
lying features. The features we had earlier are largely We saw that in many cases, certain authors who discussed
based on a violations which were defined based on Java and elaborated their comments on code reviews had more
guidelines. All the checks parse the code, analyze the useful information and guidelines to offer. These authors
abstract syntax tree and log an error if a condition is were assumed to be the ideal authors for the code review for
met. A treasure of information can also be extracted the sake of our experiments.
from checks which store some absolute values (for ex- Our feature set remained the same for the files but this
ample, the number TODO comments can lead us to analysis led to a creation of new set of target variables, which
a error where the team would have preferred more was author id. The authors who had commented on the
complete code). We analyzed all the checks in check- code review the maximum number of times was labelled as
style and created more than 150 absolute value features the ideal author. In certain cases, we observed that many
from the checks for our dataset. We experimented with authors commented an equal number of times. Therefore,
this features and only few were used in our analysis. the size of the author comments was used to break ties.
Once this data-set was created, we evaluated this data-
3.2.3 Smart Code-Review Engine set using different classifiers and noted the results as seen in
Table 5. It was again found out that Random Forest gave
After the feature engineering was complete, we moved to
the highest accuracy.
the creation of the smart code review engine. The aim of
this Code-Review theme prediction engine is to automati-
cally assign label of error theme to a given piece of code. 4. RESULTS
After classifying each Java file into error themes based on After performing the above methodology we were able to
comments and extracting 160 features for each of these files, get many interesting results and insights into the code review
we had a dataset with 925 unique data points, each of which process.
have a 160 features and a label of error theme. This data-
set is now reasonably good enough to train a model for error 4.0.1 Prediction Engine Performance
theme prediction engine and reviewer identification. For our dataset, we found Random Forest to be the best
While we tested for various models , we realized that re- classifier. We deep dived into this and found out that it
sults of the different models were going haywire. We soon performed good implicit feature selection. We understand
assessed that this is because different features had different that it worked for our data because we had a list of 160
ranges of values. For example Feature WhiteSpaceAround- features and not all of them were as important. By using
Check had values in range of 30-50 whereas feature TodoCom- Random Forest, we got a good indication of feature impor-
ment had values in range 0-10. Thus, it was necessary to tance. Also, random forests did not require much tweaking
perform normalization across all the features. and fiddling to generate a decent model, since it was not sen-
After performing the normalization of all features, we sitive to certain hyper-parameters. Another reason for the
tried to experiment with different models like Naive Bayes, good performance of Random Forests was that, it used many
Logistic Regression, Random Forest and K-Nearest Neigh- decision trees to come up with its prediction. Our labels
bours. The results of the experiments are presented in Table were created using rule based categorization, this behaviour
4. matched with the underlying structure of the decision trees
For building our final model, we chose Random Forest prediction. Since, decision trees are highly influenced by less
classifier which is an ensemble algorithm. Random forest and biased data-sets, a group of decision trees can serve a
classifier creates a set of decision trees from random subset better purpose and can better aggregate the result. We used
of training set. It then combines the votes from different 500 of these different trees by using Random forest and com-
Sample Feature Definition
CovariantEquals Checks that if a class defines a covariant method equals, then it defines method equals(java.lang.Object).
EmptyBlock Checks for empty blocks but does not validate sequential blocks.
JavadocVariable Checks that all packages have a package documentation.
MethodCount Checks the number of methods declared in each type declaration by access modifier or total count.
MutableException Ensures that exceptions (defined as any class name conforming to some regular expression) are immutable.
ParameterName Checks that parameter names conform to a format specified by the format property.
TodoComment A check for TODO comments.

Table 3: 7 of the 160 features that were fetched for each code file using CheckStyle

Classifier Data Split Accuracy false negatives.


Logistic Regression 80 Train, 20 Test 45.94 Reviewer recommendation system, however could not pro-
Random Forest 80 Train, 20 Test 76.75 duce much significant results as seen in Table 5. This can be
KNN 80 Train, 20 Test 32.97 because of the fact that our mechanism for selecting the ideal
Logistic Regression 70 Train, 30 Test 47.65 reviewer might be incomplete. We took into consideration
Random Forest 70 Train, 30 Test 79.78 the size of the contextual information written in the com-
KNN 70 Train, 30 Test 35.01 ment. However, the length of the comment cannot solely de-
Logistic Regression 60 Train, 40 Test 47.56 fine whether a reviewer is better or not. We may have to use
Random Forest 60 Train, 40 Test 77.02 advanced natural language processing techniques in order to
KNN 60 Train, 40 Test 36.21 achieve a higher accuracy for recommendations. However,
this looks like a harder problem to solve. Top reviewers for
Table 4: Accuracy for classifiers on different data a particular piece of code solely on the basis of the previous
splits data is insufficient information. We need to add the project
contextual information as well. People who are well versed
with the project might be able to offer better guidelines to
Classifier Accuracy
the author, since they have also dealt with similar issues in
Logistic Regression 48.7
the code when they worked on their part of the code base.
Random Forest 62.06
Therefore, more research is needed to answer this research
KNN 33.01 question. However, out solution lays a baseline solution on
how to approach this problem.
Table 5: Reviewer Recommendation Accuracy by
Classifier
4.0.2 CDBIR Algorithm
As discussed in the previous section, the reviewer recom-
bined the results and trained the model to be insensitive to mendation can be improved by adding project contextual
outliers and biased data. information in the random forest classifier. Based on our
We also experimented with Naive Bayes classifier, and observations and research, we created a Contextual Decay
found that by using any split of the data, the Naive Bayes based Identification of Reviewer algorithm(CDBIR), that
classifier was giving the worst accuracy, usually between 15- can be integrated with the existing random forest classifi-
20 percent. We deep dived into this and found out that our cation. This can theoretically boost the results of the clas-
feature selection was behind this behaviour. Assume, there sifier because the prediction will not only be based on the
are two types of errors in a source code file, AvoidStarIm- previous comments by authors but also by the amount of
port and AvoidStaticImport and if the code has a lot of im- contextual information each author has on the posted code
port related errors then there is a higher chance that both for review.
these error counts will be high. This means the occurrence
of both of these errors is dependent on each other. Naive
Bayes Classifier takes the assumption that the features are Result: Best reviewer for the code
independent of each other and therefore it is unable to iden- Changes := 80% of merged latest changes;
tify the interdependence of features on each other and hence Twos := Map of a change to list of people who gave +2;
cannot learn and predict better. Ones := Map of a change to list of people who gave +1;
Usually in the code reviews, there are a bunch of errors Comments := Map of change to list of comments;
that get pointed out in the code and not just one. Our sys- Tree := an ADT of all the files and folders in latest
tem does not perform a multi-class, multi-label classification VCS with all files and folders as nodes;
yet, it only outputs one error theme, which is the top most Scores := 0 array at each node;
class. However, if this were to scale to do multiclass, multi- for change: Changes do
label classification, then our system would return much bet- for node: files in changes do
ter results. It is because our system could predict top n Score := Score + CalcScore(ChangeNum, Twos,
classes regardless of their order which meant that our sys- Ones, Comments)
tem would return a set consisting of all the error themes that end
could be identified in a given piece of code and not just the end
top most error theme. Also, an important point to note is
Algorithm 1: Initializes the scores
that the present system can tolerate false positives but not
The algorithm works in 2 phases. In the first phase, a score repositories.
is calculated for each file and folder using top latest X% of Developers join and leave team all the time. Our algo-
merged commits (X can be experimented with). While cal- rithm may erroneously recommend a person who has left
culating the score, we can use a decay algorithm to give more the team. This can be solved with two approaches. The
score to recently merged changes as developers who com- first is to remove the data of the user from the system and
mented recently have more context over others. Different train again. The second is to provide a top n recommenda-
parameters like weight of +2, +1, weight of each comment tions for code reviewers as well if the first reviewer is not
etc can be tweaked to provide better results. Once a score is available.
calculated at a file level, best reviewer or a list of reviewers
can be identified with this algorithm and the result can be 6. RELATED WORK
merged with the result of the random forest classifier.
As the code review data-sets became publicly available via
OSS projects, several studies were conducted to extract and
Score := [0, 0, 0......0]; analyze these data-sets.
for node: files in newChange do
Score := Score + Scores(node) 6.1 Statistical Analysis
end Mcintosh et al. [6] performed extensive analysis on mod-
return Author with max score ern code review practises and provide insights regarding the
Algorithm 2: Best reviewer for the code relationship between post-release defects, code review cover-
age, code review participation and code reviewer expertise.
The reason our algorithm for identifying best reviewer for Mukadam et al. [7] examined the general structure of the
a new piece of code did not give good results is due to lack Android project’s code review data hosted on Gerrit. Rigby
of file/folder contextual information in feature translation. et al. [10, 11, 9] researched on many review data projects
Adding this information will be key to the performance of and tried to analyze how much time does reviews take, num-
any system making such predictions. ber of people involved, and how effective reviews are. They
We could not implement the algorithm because of the also identified several guidelines that developers need to fol-
limitations of the data and the underlying constraints of low in order to make their patches accepted by reviewers. ,
the availability of the contextual information. We needed Baysal et al [1] investigated the main factors that can affect
more visibility in terms of each project structure and since the review time and patch acceptance. Their experiments
the analysis was performed on a data dump of open source emphasized the most important factors which were the af-
repositories with random projects, we could not get enough filiation of the patch writer and the level of participation
contextual information from all the projects to test the ef- within the project. In our project, we too have tried to per-
fectiveness of the algorithm on the random forest classifier. form similar analysis so that we can form the base to support
why we want to automate the code review process.
5. THREATS TO VALIDITY 6.2 Error theme prediction
Some of the studies have been conducted to help devel-
5.1 Identifying Error themes opers with the code review process by providing feedback
In the error theme identification, our comment rule based about the review outcome. Jeong et al. [4] studied code re-
system theme segmentation might not be enough to com- view data from the Bugzilla system and attempted to predict
pletely generalize the theme tagging process. In such a case, the review outcome. Hellendoorn et al. [3] used language
our error theme labelling might not be completely robust models to compute how similar a CR is to previous CRs.
and that can have an effect on the reported numbers. Also, They then predicted whether the CR would be approved
the data set used was very small. Our data to feature ratio based on the review outcomes of similar CRs. Mantyla et
was about 5. This meant that the data was not enough to al.[5] classified the defects of nine industrial and 23 student
completely learn such a large number of features. This could code reviews, detecting 388 and 371 defects. Their research
mean that given more data we would have been able to bet- classified defects into evolvability and functional defects and
ter validate our findings. Finally, for the sake of the analysis, suggested that code reviews are better for finding evolvabil-
we never took the project context information from files into ity defects. Czerwonka et al. [2] performed analysis on the
consideration. This context is crucial in understanding the code review repositories to identify the error themes that oc-
kind of errors an uploaded file may contain. Files related to curred more frequently. The error themes created by them
the same project might also end up having the same errors were taken as base for our research. Our research is relevant
or files uploaded that reflect a similar functionality in the because, we are trying to identify the error theme for a given
project may also have similar types of errors. This was a piece of code based on the history of comments received by
qualitative set of features and were not fed into the classi- the same type of files previously in the repository.
fiers.
6.3 Recommending peer reviewers
5.2 Identifying ideal code reviewer There has also been some work done on the the reviewer
Our biggest assumption while identifying code reviewers selection process. Ouni et al. [8] implemented a genetic
was a person who has reviewed a similar complex code in algorithm-driven approach to identify the most appropriate
the past and this is the best person to review a similarly peer reviewers, and they achieved 59 percent precision and
complex code in the future. Since different teams have ideas 74 percent recall for accurately recommending reviewers.
about how to do code reviews, we proceeded with this ap- Also, Thongtanunam et al. [12] investigated a file location-
proach because it will work for majority of the open source based code-reviewer recommendation approach, leveraging
the file path similarity of a previously reviewed file path to can review the code fast as well as better. Such a system will
recommend an appropriate code reviewer.For building our potentially save a lot of time for the authors and reviewers
recommendation system we took inspiration Zanjani et al. and thus have a positive impact on software productivity as
[15] where they were able to leverage information in previ- the authors and reviewers will have more to work on feature
ously completed reviews to suggest reviewers. development. The system that we have developed is a proof
of concept for now, but we are sure that if it is scaled fur-
7. FUTURE DIRECTIONS ther, it would be able to make the code review process much
more efficient.
7.1 Identifying Error themes
A lot of code health check tools e.g., checkstyle, have a ton 10. REFERENCES
of features and most teams run the tool out of the box. Most [1] O. Baysal, O. Kononenko, R. Holmes, and M. W.
of the comments that we have observed can be automatically Godfrey. Investigating technical and non-technical
caught from the right configuration. But the team’s opinions factors influencing modern code review. Empirical
and ideas can change over time and the configuration of such Softw. Engg., 21(3):932–959, June 2016.
tools might not be up to date. The features we used from [2] J. Czerwonka, M. Greiler, and J. Tilford. Code
checkstyle are great for identifying individual errors but they reviews do not find bugs: How the current code review
do not identify complex errors. best practice slows us down. In Proceedings of the 37th
International Conference on Software Engineering -
7.2 Identifying reviewers Volume 2, ICSE ’15, pages 27–28, Piscataway, NJ,
Identifying the right person to review the code is a subjec- USA, 2015. IEEE Press.
tive discussion. Different teams do it differently. We tried [3] V. J. Hellendoorn, P. T. Devanbu, and A. Bacchelli.
to identify a reviewer based on the features of the code. The Will they like this? evaluating code contributions with
accuracy is close to 60%. The downside is our approach will language models. In 2015 IEEE/ACM 12th Working
not work well when there is not much data to learn pat- Conference on Mining Software Repositories, pages
terns from and misses the file context information. A future 157–167, May 2015.
extension to our work can make use of features from the [4] G. Jeong, S. Kim, T. Zimmermann, and K. Yi.
file contribution information to improve the accuracy. This Improving code review by predicting reviewers and
coupled with the automatic detection of error themes from acceptance of patches. 2009.
a Gerrit change, saves significant time for everyone. [5] M. V. Mantyla and C. Lassenius. What types of
defects are really discovered in code reviews? IEEE
8. ACKNOWLEDGEMENTS Trans. Softw. Eng., 35(3):430–448, May 2009.
The project heavily relied on a modified version of check- [6] S. Mcintosh, Y. Kamei, B. Adams, and A. E. Hassan.
style, an opensource tool and wouldn’t be realized without An empirical study of the impact of modern code
it. The computing resources were provided by Professor review practices on software quality. Empirical Softw.
Meiyappan Nagappan which were part of a grant from Mi- Engg., 21(5):2146–2189, Oct. 2016.
crosoft Azure and we are thankful for that. [7] M. Mukadam, C. Bird, and P. C. Rigby. Gerrit
software code review data from android. In 2013 10th
9. CONCLUSION Working Conference on Mining Software Repositories
(MSR), pages 45–48, May 2013.
Source code contributions to OSS projects are mostly eval-
uated through code reviews before being accepted and merged [8] A. Ouni, R. G. Kula, and K. Inoue. Search-based peer
into the main development line. We can therefore say that reviewers recommendation in modern code review. In
Code Review process is essential in maintaining the stan- 2016 IEEE International Conference on Software
dards of Software Development in projects. However, as Maintenance and Evolution (ICSME), pages 367–377,
we can see in the analysis presented, the process for code Oct 2016.
review is time consuming for both the developers as well [9] P. C. Rigby. A preliminary examination of code review
as for project completion. Most of the times, code reviews processes in open source projects. 2005.
create bottleneck for developers to efficiently finish the de- [10] P. C. Rigby and C. Bird. Convergent contemporary
velopment of a particular feature. In this paper, we have software peer review practices. In Proceedings of the
successfully shown that combining information from file vi- 2013 9th Joint Meeting on Foundations of Software
olations and peer comment behaviour, can create a system Engineering, ESEC/FSE 2013, pages 202–212, New
that can efficiently predict the kinds of error themes that York, NY, USA, 2013. ACM.
may get pointed out in the code review of files. The authors [11] P. C. Rigby, D. M. German, L. Cowen, and M.-A.
can thus fix those error themes and the push a better ver- Storey. Peer review on open-source software projects:
sion for review. This will eventually reduce the number of Parameters, statistical models, and theory. ACM
revisions because the system will act as a warning system Trans. Softw. Eng. Methodol., 23(4):35:1–35:33, Sept.
and only a better and less error prone code is submitted for 2014.
review and that code will have a higher chance of getting a [12] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula,
ship-it from the reviewers or the reviewers would take less N. Yoshida, H. Iida, and K. i. Matsumoto. Who
time by the reviewer to approve the code.The underlying should review my code? a file location-based
prediction model was able to identify error themes with an code-reviewer recommendation approach for modern
accuracy of about 75% Our system is also able to recom- code review. In 2015 IEEE 22nd International
mend the author with a list of potential peer reviewers that Conference on Software Analysis, Evolution, and
Reengineering (SANER), pages 141–150, March 2015.
[13] V. V and P. Jalote. List of common bugs and
programming practices to avoid them, 2005.
[14] X. Yang, R. G. Kula, N. Yoshida, and H. Iida. Mining
the modern code review repositories: A dataset of
people, process and product. In Proceedings of the
13th International Conference on Mining Software
Repositories, pages 460–463, 2016.
[15] M. B. Zanjani, H. Kagdi, and C. Bird. Automatically
recommending peer reviewers in modern code review.
IEEE Transactions on Software Engineering,
42(6):530–543, June 2016.

You might also like