You are on page 1of 25

RESEARCH METHODS IN COMPUTER SCIENCE

Research Design : An Empirical Approach

Introduction

Computer science ,like other scientific disciplines, needs specific methods and tests to justify the results that have been produced in the research. There are a lot of research in CS that leave this statistical part out of their discussion and put the demonstration of the prototype as the ultimate justification of the findings. In order for the research to be widely accepted, CS researchers must accompany their results with some justifications to prove that the numbers are valid and correct. There are a number of approaches that can be used to carry out research in CS; empirical approach is one of them. Empirical approach uses statistics as one of the ways to analyse the findings or test hypotheses. Analysing the data using statistics would require researchers to understand some basic notions in statistics as this chapter is about to explore.

There are several reasons why statistical analysis is relevant in computer science;

The analysis explains the results in a common platform that everyone will understand.. It is an explanation of the situation being studied. This understanding gives a clearer picture that may have not been understood before. Measures whether the research has been successfully executed. Answers the research questions. Evaluate the topic currently being investigated whether there is more to it or it is a dead end and researcher should find another way to understand the situation.

The fact that statistical analysis is so hard to work on is just another myth and if properly handled and patiently studied will benefit the researcher in the long run. The knowledge gained after each analysis process will remain with the researcher to tackle the next problem. It grows and after each research , a researcher will feel more confident and try to get the most out of the process.

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

Level of Measurement

Level of measurement

Interval scale

Ratio scale Nominal scale All tests Chi sq test

t-test/F-test

Ordinal scale

Non-parametric test

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

Nominal Scale

The word nominal means to name. This is used in statistics to be utilized in the dataset so as they can properly represent the data. Numbers are assigned to variables only to classify or categorize them; such as : 1= male, 2=female. In this manner the data set can be easily manipulated and aid in data analysis. The only arithmetic that is relevant for such group of data is counting. The statistical usage is very limited and normally good for keeping track of people, objects & events. The common statistical test that can be performed for this kind of data set is a chi-square test. Ordinary Scale Numbers or values are assigned to the objects or events to establish rank or order; such as, 1st , 2nd, 3rd or 4th positions. Intervals of the scale are not equal, i.e., adjacent ranks need not be equal in their differences. For data to be in this manner, there are no more precise comparisons possible. The median is an appropriate measure of central tendency. A percentile or quartile is used for measuring dispersion. Rank order correlations are possible. The statistical tests that are possible for this data are nonparametric tests. This is commonly used in qualitative research.

Ordinal level question In which category was your income last year? 1. above RM100K 2. RM50K RM100K 3. Below RM50K .

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

Interval Scale The numbers or values assigned to the objects or events which can be categorized, ordered and assumed to have equal distance between scale values. The usual representation would be the test score or degrees of temperature : such as a fahrenheit temperature scale (72) degrees or test score (0 100). There is no absolute zero or unique origin; only an arbitrary zero can be had and hence no capacity to measure the complete absence of a trait or characteristic This type of data is more powerful than ordinal scale due to the concept of equality of interval. The sample mean is an appropriate measure of central tendency. Standard deviation(SD) is widely used for dispersion. The common relevant statistical tests for this data are t-test & F-test for significance. Ratio Scale The numbers in this data set represent the objects or events which can be categorized, ordered and assumed to have equal distance between scale values and have a real zero point. The values can be used for all statistical tests that conform with the requirements for the particular test. The highest level of measurement ; All mathematical operations and statistical techniques can be applied ;all manipulations that are possible with real numbers can be carried out.

Ratio level question : What was your income last year ? .

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

Figure 1 : Steps in The Planning Phase

Population

Population refers to the entire group of the subject to be studied. The size of the population is very important to be determined due to the need of inference later on. The

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

validity of the results as a reference to the whole population depends onhow much data are collected to infer the entire population.

Sampling

Sampling is a process to draw some elements from the population and analyse it. Since sampling unit is the subgroup of the object under study, it must reflect the whole group as uch as possible. Failing to do so will jeopardize the data, the results and the findings. In sampling process, we need to define sampling frame and sampling methods. The sampling frame represents the elements in a population where a sample is drawn. It could be membership names, staff directories, registered students, zakat recipients, licensed traders, and the likes. With clearly defined group, researcher can determine who should be included and who should be excluded; thus, would minimize the amount of error in the data. Once the sampling frame is determined, a researcher can select an appropriate sampling method.

How to determine the sample size

The sample size must represent the entire population of the subjects being studied. In order to avoid any error due to misrepresentation, the sample size for a simple random must specify :

The level of confidence The acceptable amount of errors The values of the SD or proportion

Research Design

Research methodologies are commonly characterized by the research designs. Each research design will specify the method used in the experiment to collect the data. Research design is defined as a plan for conducting research which usually includes specification of the elements to be examined and the methods used.

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

The research design will be selected to serve the purpose as the most suitable and feasible methods for hypothesis testing or answering research questions. The diagram in Figure 2 shows the different research designs.

Figure 2 : Research Designs

True Experimental Design

In experimental design, a specification for a research study is laid out to answer specific research question such as - Does variable A cause variable B to increase in value? The plan must include

Methods of selecting and assigning subjects Number and types of variables.

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

The main purpose of the experiment is to apply some controls and study the cause and effects of the variables on each other. The variables known as independent variables might be assumed to cause the ones known as dependent variables. Thus, the experiment will show with exact values whether or not the assumption is correct and acceptable. Control mechanisms in experimental design play an important role towards the validity as well as the reliability of the results. In order to make sure that the results are valid, the researcher must be able to :

Have a good amount of data form at least two comparison groups. Apply random selection Manipulate the independent variable to apply different treatments.

The threats to validity in experimental design are : 1. The events which occur between the 1st and 2nd measurements. 2. The changes in the subject during the course of the experiment 3. Subjects might change their opinion after the first measurement. The second measurement might be different due to this knowledge. Researcher or RA might change the way they execute measurement.

Factorial Design This research design is one of the true experimental whenever the research has more than one independent variables.

Solomon 4 group Design Pretest-Posttest Design Posttest Only Design

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

Figure 3: Steps in The Action Phase

NORDIN ABU BAKAR 2011

RESEARCH METHODS IN COMPUTER SCIENCE

How do you do statistical analysis?

The process actually starts in the very beginning of a research and not after all data have been collected. At the proposal stage, statistical analysis should be considered in as detail as possible. This is due to the fact that the data environment depends on statistical analysis and not the other way around. The researcher should decide what kind of statistical analysis should be adopted based on the nature of the research and the area being investigated. Basically there are a few items that can be considered to determine the approach for statistical analysis:

Type of research Type of data Hypothesis

Type of research

Types of research

Qualitative

Quantitative

Qualitative Analysis

Statistical Analysis

Descriptive

Inferential

Figure 4: Types of Research

NORDIN ABU BAKAR 2011

10

RESEARCH METHODS IN COMPUTER SCIENCE

Type of data

Non-quantitative

Data

Quantitative

Percentage ,mode,tables and charts

profile

Mean,median, min,max,stdiv

Cross-tab, chi square

associations

Correlation analysis

Nonparametric methods

Hypothesis testing

Parametric methods

Figure 5: Types of Data

Mean, Median and Mode

The central tendency has been pivotal in statistical analysis to show a valid generalization of the data. When there is a bunch of numbers then the next valid thing to do is to pick one number that will represent the whole group. A question may arise why we have picked such a number; but in statistics there is a reason for that particular action. The layman might use an average as the reason for taking that number but then average could mean a number of other reasons; such as, the mean, the median or the mode. The mean is basically an arithmetic average; when the total value of those numbers are divided by the count of the numbers. Lets say we have 30 examination marks from our class. The mean of our examination marks is the total marks divide by the numberof students taking the examination. For example :

NORDIN ABU BAKAR 2011

11

RESEARCH METHODS IN COMPUTER SCIENCE

K 45,30,67,89,55,66,75,49,50,85

Mean Total/Count

Mean_K 611/10 =61.1

The mean value represents the total picture of the group or list of numbers. It takes into account all values in the group and this gives positive justification on why the mean can represent the group. However, the mean can work well if all the values are welldistributed; otherwise if there exists one odd value (called outliers) the mean could be well off. The process to derive the mean also could be a constraint if the data is large. The median is the middle number in a list. When there a list of nu mber and the numbers are positioned in order; median will be the middle number if the count of the numbers in the list is odd or the value in between two middle numbers if the count is odd. For example ;

K 45,30,67,89,55,66,75,49,50,85

Median (Mid1+Mid2)/2

Median_K (55 + 66)/2 = 60.5

Compared to the mean, the median can resolve that problem whenever there is an odd value in the data; because it does not affect the derivation of the median. The process is also simple and easy to manipulate is it does not include any numerical computation. But whenever there is no numerical computation, the value taken as the median is not precise does not tell much about the data. The mode is the number occurs most frequently in the list. For example ; K 2,2,3,4,5,4,5,4,4,4 Mode Most number occured Mode_K 4

This mode process is quite simple and straight forward. Due to its simplicity, the mode is very raw and could be questionable especially when there is not much difference in the frequencies of the data.

NORDIN ABU BAKAR 2011

12

RESEARCH METHODS IN COMPUTER SCIENCE

Standard Deviation

Standard Deviation (SD) is the measure of dispersion from the mean. After a value for mean has been determined, other data can be calculated for their distance from the mean and this is called the standard deviation(SD). If the SD is large, this indicates that the data are widely distributed with many values are far away from the mean. On the other hand, if the SD is small, most values are very close to the mean. If the SD is zero then there is no dispersion and all values are the same. The SD value will position the data above or below the mean; and this can be used to evaluate that particular value against the rest of the values in the data list.

Correlation

When the objective of the research is to find relationships between variable then correlation analysis is inevitable. In computer science, this research element is also present and very common. For example, the study might want to find the relationship between parameter A and parameter B in an application.; the software errors and testing procedures or programmers attitude or relationship between machine architecture and speed of execution etc. If one variable scores as highly as the other variable; then the relationship is referred to as positive correlation. On the other hand if one variable scores highly but the other one scores low there is a negative correlation. There are also some cases when relationships are scattered around and do not present a cohesive trend; the relationship is deemed zero correlation.

In order to determine the type of correlation for the data set , a scatter graph will do the job but for more precise numerical value a statistical test must be performed. Pearsons and Spearmans rho are two example of statistical tests that can facilitate the correlation analysis. The value produced from the tests is referred to as correlation co-efficient which could be in the range of +1 and -1. The co-efficient that close to +1 has a strong positive correlation, the one that close to -1 has a strong negative correlation and the closer it is to 0 has a weak correlation. Lets say after the test, the co-efficient between X and Y is 0.8(r = 0.8) and the coefficient between X and Z is 0.4 (r=0.4). The interpretation of correlation co-efficient is

NORDIN ABU BAKAR 2011

13

RESEARCH METHODS IN COMPUTER SCIENCE

quite tricky and not as straight forward as it may seem. A correlation between X and Y is in fact 64% (100*r^2); meaning that 64% of the time whenever X changes, Y changes as well. A correlation between X and Z then is 16% (100* r^2); meaning that 16% of the time whenever X changes, Z will change as well. However , it is important to note that correlation analysis does not tell us whether X causes Y to change or not; it just gives an indication of correlated change exists. If one wants to know whether X causes Y or not , then a causal research should be employed where specific experiments are carried out in the lab to find out whether X causes Y to change or not. This will lead us to another interesting test in statistical analysis called hypothesis testing. Correlation Analysis with SPSS

What is it? Very often researchers intend to visualize a situation when things are different. The needs to explain the circumstances in a more rational manner will take us to find a reasonable way to analyze and make a conclusion. Do students who are good in math achieve higher CGPA and do those who are not good in math get lower CGPA? What we are trying to understand is whether the variable good in math correlates with the variable CGPA. If this was the case then we would say that there is a positive correlation between the variables. Positive correlation means as a score on one variable increases so the corresponding score on the other variable does the same (in SPSS, the value will be positive). As in the above context, if the student is good in math his CPGA will be higher as well.

There is a situation when we find a score on one variable goes up ,the score on the other corresponding variable goes down. This is referred to as negative correlation (in SPSS, the value in the table will be negative). One example of such correlation is between weight of a person and the health. As the weight goes up the less healthy that person tends to be.

NORDIN ABU BAKAR 2011

14

RESEARCH METHODS IN COMPUTER SCIENCE

What is it for?

Measures the strength and direction of linear relationship between a pair of variables. If we have more than two variables, then we need a multivariate analysis.

How to use it?

Using SPSS, the steps to follow are as follows :

Statistics

Correlate

Bivariate

Correspondence dialogue box will appear Select the variables to correlate and move them to the variables box. Choose the appropriate correlations coefficients o o Interval data _> Pearson Ordinal data _> Spearman

Select one-tailed or two-tailed test o o One-tailed -> direction of the relation is known Two-tailed -> direction is unknown

Ok

Sample output is as follows :

Variable A Pearson Correlations Variable A Variable B Variable C Sig.(2-tailed) Variable A Variable B Variable C . 0.002 0.005 1.000 0.690* 0.840*

Variable B

Variable C

0.690* 1.000 0.750*

0.840* 0.750* 1.000

0.002 . 0.000

0.005 0.000 .

NORDIN ABU BAKAR 2011

15

RESEARCH METHODS IN COMPUTER SCIENCE

N Variable A Variable B Variable C 20 20 20 20 20 20 20 20 20

*correlation is significant at the 0.01 level(2-tailed)

Testing the hypothesis

Hypothesis is an assumption that one can make for an impact that happened on certain variables. For example,

1. Parameter A is better than parameter B. 2. Selection sort performs faster than Binary sort 3. Cost affects software performance

Using statistical techniques the outcome of this process whether the hypothesis is accepted or rejected can be properly justified and supported. A few steps to exercise hypothesis testing is as follows :

1. Choose a null hypothesis. Make an opposite assumption on a vital variable of your study. If the study wants to prove A, choose B (opposite of A)as a null hypothesis. This is a trick when B is rejected, A will be statistically true. 2. Choose an alternative hypothesis that can be accepted in case the original hypothesis (in 1) is rejected. This is A (as in 1) and in case B is rejected, this can be accepted even though there is no evident that it is true. 3. Make a condition when to reject the hypothesis and when not to reject it. 4. Draw a sample in random and select a statistical method. 5. Based on the test, choose to reject or not to reject null hypothesis. 6. Pick alternative hypothesis if null hypothesis is rejected.

NORDIN ABU BAKAR 2011

16

RESEARCH METHODS IN COMPUTER SCIENCE

Hypothesis Testings

To describe the process of hypothesis testing we feel that we cannot do better than following the five-step method introduced by Neave(1976a as appeared Kanji (1999)):

Step 1

Formulate the practical problems in terms of hypotheses. A focus should go into creating an alternative hypothesis,Ha, since this is more important from practical point of view. This should express the range of situations that we wish the test to be able to diagnose.In thissense a positive test can indicate that we should take action of some kind. Once this is fixed, it should be obvious whether we carry out a one- or two-tailed test. The null hypothesis ,H0, needs to be to be very simple and represents the status quo, i.e., there is no difference between the processes being tested. It is basically a standard or control with which the evidence pointing to the alternative can be compared.

Step 2

Calculate a statistic (T), a function purely of the data All good test statistics should have two properties : (a) they should tend to behave differently when H0 is true from when Ha is true; and (b) their probability distribution should be calculable under the assumption the H0 is true. It is also desirable that tables of this probability distribution should exist.

Step 3

Choose a critical region. One should decide on the kind of values of T which will most strongly point to Ha being true rather than H0 being true. A value for T lying in a suitably defined critical region will lead us to reject H0 in favour of H1; if T lies outside the critical region we do not reject H0. We should never conclude by accepting H0.

Step 4

NORDIN ABU BAKAR 2011

17

RESEARCH METHODS IN COMPUTER SCIENCE

Decide the size of the critical region. This involves specifying how great a risk we are prepared to run of coming to an incorrect conclusion.

NORDIN ABU BAKAR 2011

18

RESEARCH METHODS IN COMPUTER SCIENCE

Chi-Square Test

What is it?

The data that have been collected need to be processed and analyzed. If the data is of type non-quantitative which is not numerical but some criteria such as sex and having a headache, then chi-square test can be used for the process. Is there a connection between these two criteria ?, we may ask in the research. In statistics, this is called measures of associations. The research is looking into the associations between two variables which are not numbers in nature.

What is it for ?

Measures of associations between two variables Level of distribution in the population

How to use it?

The test can be used if a table of frequency can be produced. Using SPSS, the following steps will derive some results.

Analyze

Summarize

Crosstabs

SPSS will pop up a dialogue window Select the appropriate variables Click on Statistics to choose the appropriate testchi-square. Ok

NORDIN ABU BAKAR 2011

19

RESEARCH METHODS IN COMPUTER SCIENCE

The results may be as follows :

Value

df

Asymp. Sig. (2-sided)

Pearson Chi-square Likelihood Ratio Linear-by-Linear Association N of valid Cases


a

43.617a

.000

46.826 41.263

4 1

.000 .000

250

0 cells (.0%) have expected count less tan 5. The minimum expected count is 12.10. Column 2 gives the value of the test Column 3 states the degree of freedom (df) Column 4 gives the indication that the results is significant or not.

Take these values and compare the Pearson Chi-square value against the table with (df = 4) and (level of significant = 0.05) . If this value is less than the one in the table, then null hypothesis (H0) will be rejected. Therefore, alternative hypothesis will be true.

NORDIN ABU BAKAR 2011

20

RESEARCH METHODS IN COMPUTER SCIENCE

Confidence Interval

When the measurements have been collected from the experiments or tests , there is a set of numerical values that need to be analysed. Some relevant questions are ; 1. how certain can we be of the values ?

2. If there is a set of two values; how certain can we be that the two set of values are different ?

The need to answer these questions will lead us to some more of statistical analysis and its vital role in computer science research. Lets say we have the mean sample as equal to 0.248. So what? What does it mean? Is it good or bad? There must be a way to justify this value and put some kind of evaluation criteria to the number. How confident that the value is a true mean ? In other words , it is to come up with a confidence interval that will specify the value as follows : if a sample of 40 was drawn and the mean calculated, 95% of the time, the mean would lie in between a lowerbound(lb) and an upperbound(ub) in such a way that lb < Mk < ub. The bootstrap method is suitable to facilitate this process and it is done in the following steps : Choose 1000 random sample (with replacement) of size 40 from our original 40 points. Take the mean of each sample. Sort and take the value at the 25th and 975th positions.

The lower bound is 0.2451 and the upper bound is 0.2505. Since 0.2451 < 0.248 < 0.2505

The calculated mean can be accepted as the true mean of the said population above.

NORDIN ABU BAKAR 2011

21

RESEARCH METHODS IN COMPUTER SCIENCE

The Bootstrap Method

The bootstrap method is an attractive procedure for CS researchers as it offers a perfect getaway from the usual complicated and annoying statistical procedures. This suits the CS environment nicely due to flexibility in the sampling criteria. The basic samples of data used to find the confidence intervals have distributions which depart from the traditional parametric distribution. The constraints in producing enough data for any statistical procedures are very common in CS research; and many have resorted to stay away from using any statistical analysis. The bootstrap method gives an opportunity to produce statistically reliable analysis regardless of the form of the data probability

density.; it makes no assumption about the different data distributions. Probably the main definitive point regarding the bootstrap method is that the entire sampling distribution is estimated by relying on the fact that the samples distribution is a perfect estimate of the population distribution. On the other hand, the traditional parametric inference depends on the assumption that the sample and the population are normally distributed.

The bootstrap method was initially proposed by Efron in 1979. He used Monte Carlo sampling to generate an empirical estimate of the sampling distribution. Monte Carlo sampling builds an estimate of the sampling distribution by randomly drawing a large number of samples of size k from a population, and calculating for each one the associated value of the statistics. The relative frequency distribution of these values is an estimate of the sampling distribution for that statistic.

The procedure The generic bootstrap method has the following basic ideas as presented by (Efron and Tibshirani (1994)) ;

A bootstrap sample is a sample composed by ) x , , x , (x x n 2 1 * * * *= _ that is obtained in a random form with repositioning from the experimental sample ) x , , x , (x x n 2 1 _ = , also designated bootstrap population. Here, the asterisk denotes that * x is a randomized version, or resampling of x, rather than a new group of actual data. The bootstrap sampling consists of corresponding members of x. For each bootstrap

NORDIN ABU BAKAR 2011

22

RESEARCH METHODS IN COMPUTER SCIENCE

procedure one should carry out a random resampling by sampling with replacement using the n elements from the experimental sample, which will be employed as parent population. Thus, the arithmetic mean * i x is reached using the follow equation 1. After a number m of resamplings, the arithmetic bootstrap mean * m x is obtained by equation 2, with standard deviation given from the equation 3.

The bootstrap distribution of probability is a result of the sequence bellow: In practice, the bootstrap distribution is built form the Monte-Carlo Method with a number of repetitions, for a sufficiently large m. In this case, the bootstrap mean approximates the mean of population and the distribution tends to a normal one (Manly, 1997). The convergence is guaranteed by the great numbers law, because, ) x , , x , (x n 2 1 * * * _ are nothing more than a sample of independent random variables and are identically distributed.

An implementation of the bootstrap method in C++ is as follows ; #include <stdlib.h> get_data(); //put the variable of interest in the first n elements of the array X[]. randomize(); //initializes random number generator for (i=0; i<100; i++){ for (j=0; j<n; j++) Y[j]=X[random(n)];/* selects a random sample from X with replacement */ Z[i]=compute_statistic(Y); /* Replace this call with a call to the function that computes the statistic whose precision you wish to estimate. */ //compute the statistic for the array Y and store it in Z compute_stats(Z); }

NORDIN ABU BAKAR 2011

23

RESEARCH METHODS IN COMPUTER SCIENCE

Conclusion

Statistical analysis can enhance the findings that have been produced in the research. The important part is to make sure that the results are well understood to assist in the evaluation of the whole research project. This is instrumental to CS researchers so that the system or application being produced as the outcome of this research is free from experimental flaws or software bugs. A strong justification of the parameters used, methods chosen or techniques implemented can safeguard the development stage ,that might follow after the research period, from any errors.

The bootstrap method discussed in this chapter is a brave diversion from the traditional parametric inference that has improved analysis in many CS research. The method works well in certain circumstances but behaves badly in others. It is good for normal distribution but tend to be problematic for skewed distributions; so the use of such method must be adopted in great care and clear understanding of the data.

NORDIN ABU BAKAR 2011

24

RESEARCH METHODS IN COMPUTER SCIENCE

References

Bradley Efron (1979). "Bootstrap methods: Another look at the jackknife", The Annals of Statistics, 7, 1-26.

Bradley Efron (1981). "Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods", Biometrika, 68, 589-599.

Bradley Efron (1982). The jackknife, the bootstrap, and other resampling plans, In Society of Industrial and Applied Mathematics CBMS-NSF Monographs, 38.

P. Diaconis, Bradley Efron (1983), "Computer-intensive methods in statistics," Scientific American, May, 116-130.

Bradley Efron, Robert J. Tibshirani, (1993). An introduction to the bootstrap, New York: Chapman & Hall, software.

Davison, A. C. and Hinkley, D. V. (1997): Bootstrap Methods and their Applications, software.

Mooney, C Z & Duval, R D (1993). Bootstrapping. A Nonparametric Approach to Statistical Inference. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-095. Newbury Park, CA: Sage.

Simon, J. L. (1997): Resampling: The New Statistics. Good, P.I. Resampling Methods : A Practical Guide To Data Analysis, ISBN : 978-0-8176, Springer, 2005.

NORDIN ABU BAKAR 2011

25

You might also like