You are on page 1of 3

Increasing Assessment Reliability By Alvin Wallace, PhD Regarding the notion of increasing reliability Miller(n.d.

) wrote:the general con vention in research has been prescribed by Nunnally and Bernstein (1994) who sta te that one should strive for reliability values of .70 or higher. Also noting th at reliability values increase as test length increases, citing Gulliksen (1950), yet he cautions that the problem with simply increasing the number of scale items when performing applied research is that respondents are less likely to partici pate and answer completely when confronted with the prospect of replying to a le ngthy questionnaire. The advice given is the best approach is to develop a scale t hat completely measures the construct of interest and yet does so in as parsimon ious or economical a manner as is possible. Typical classroom assessments factors such as order and grouping of questions, presentations of attribute lists to be rated, the look and feel of the instructions. Also when comparing the actual as sessment exam with typical classroom assessments factors such as order and group ing of questions, presentations of attribute lists to be rated, the look and fee l of the instructions might impact the error rate. These as well as the speed of with which the computer operates, the use of graphics, and background colors an d/or fonts. A crucial evaluation might also question issues such as 1) it is easie r to click than to type a response, 2) does the testing mechanism use radio butto ns or drop down menus? Radio buttons are best with five or fewer answers. And drop down-lists are better with greater than five options. For the lengthy list use sho rt identifiers, i.e., abbreviations, allow the user to type in abbreviation rathe r than scrolling the list. For the multiple response use check boxes and includ e an option for other responses. (DSS Research, 2000, p. 1 - 10) These are issues that the CCNA examinee faces are incorporated in the actual format Cisco Career Certification exams include the following test formats: Multiple-choice single answer Multiple-choice multiple answer Drag-and-drop Fill-in-the-blank Testlet Simlet Simulations Cisco advises before taking the exam, candidates should become familiar with how all exam types function-especially the testlet, simlet, and the simulation tool. Examples of each type are found at: [ http://www.cisco.com/web/learning/wwtraining/certprog/training/cert_exam_tutor ial.html ] Drissell (2007) has argued that an effect size of 1.0 corresponds to approximatel y one-grade difference in elementary school, while an effect size of 1.5 is extre mely rare in educational research. It follows that the relative complexity of the assessment task will probably impact the scores obtained relative to the propos ed learning outcomes. It can also be successfully argued that the whether one wa nts to assess the reliability of test scores or the amount of measurement error in test scores, it is the statistic used that makes the difference, especially when accuracy and prediction strength are factors to consider. For example, Gronlund & Waugh (2009, p. 59) state that the reliability of test scores is typically repo rted by means of a reliability coefficient or standard error of measurement that is derived from it. So, then in the assessment of complex tasks such as those fo und in the CCNA assessment, one might also use Cronbach's alpha, which measures h ow well a set of items (or variables) measures a single unidimensional latent co nstruct, since when data have a multidimensional structure, one wants a Cronbach' s alpha that scales low. It is, in fact, a coefficient of reliability (or consisten cy), according to SPSS. When one considers the statistics that measure relations hips between variables, it can be argued that regression and correlation are the ones that can be guage the similarity of assessment instruments, so the cloud o f points examined would approximate a straight line, if one argues that a test i

s reliable. Since the argument concerns the use inferential statistics (e.g., t-tests, ANOV A, etc.) to analyze your evaluation results, the implication is that statistics such as power analysis, statistical significance, and effect size if not attende d to can make a difference. The inferential statistical test category includes t ests comparing two hypotheses. This is what one does when comparing two hypothes es in the familiar situation where one conducts a hypothesis test, using a null, i.e., the assertion that the program, curriculum, experiment had no effect vers us the assertion that the alternative hypothesis, that the curriculum did have a n effect. Finding a difference that does not exist is called a Type 1 error or i ts opposite, i.e., the curriculum actual did not an effect but it is not reporte d., Type II error. Power refers to the probability that your test will find a sta tistically significant difference when such a difference actually exists.( Wolske & Higgs, 2007) It is said to be the probability that you will reject the null hy pothesis when you should, and it is generally accepted that power should be .8 or greater; that is, you should have an 80% or greater chance of finding a statisti cally significant difference when there is one. The argument is that this is simi lar to reliability, since as [the] sample size increases, so does the power of yo ur test, which is to say a larger sample means that you have collected more inform ation -- which makes it easier to correctly reject the null hypothesis when you should. It can be computed from a regression analysis. Alpha size is important in its calculation and if the power calculated is less than 0.8, the sample size s hould be increased. (Wolske & Higgs, 2007) To examine the likelihood that the changes you observe in your participants knowle dge, attitudes, and behaviors are due to chance rather than to the program one te sts for statistical significance to determine how likely it is that these changes occurred randomly and do not represent differences due to the program. ( Wolske & Higgs, 2007) This aspect of reliability, if you will is derived from a compar ison of probability number you get from your test (the p-value) to the critical probability value you determined ahead of time (the alpha level). ( Wolske & Higg s, 2007) The actual test being, if the p-value is less than the alpha value, you can conclude that the difference you observed is statistically significant. The T ype I error, is chance that your results are due to chance rather than to your p rogram. So, an alpha of .05 means that you are willing to accept that there is a 5 % chance that your results are due to chance rather than to your program, ( Wolsk e & Higgs, 2007) while the P-value, is the probability that the results were due to chance and not based on your program, ( Wolske & Higgs, 2007) and they range from 0 to 1. It follows, then that the lower the p-value, the more likely it is th at a difference occurred as a result of your program. ( Wolske & Higgs, 2007) To examine the likelihood that the changes you observe in your participants knowle dge, attitudes, and behaviors are due to chance rather than to the program one te sts for statistical significance to determine how likely it is that these changes occurred randomly and do not represent differences due to the program. ( Wolske & Higgs, 2007) This aspect of reliability, if you will is derived from a compar ison of probability number you get from your test (the p-value) to the critical probability value you determined ahead of time (the alpha level). ( Wolske & Higg s, 2007) The actual test being, if the p-value is less than the alpha value, you can conclude that the difference you observed is statistically significant. The T ype I error, is chance that your results are due to chance rather than to your p rogram. So, an alpha of .05 means that you are willing to accept that there is a 5 % chance that your results are due to chance rather than to your program, ( Wolsk e & Higgs, 2007) while the P-value, is the probability that the results were due to chance and not based on your program, ( Wolske & Higgs, 2007) and they range from 0 to 1. It follows, then that the lower the p-value, the more likely it is th at a difference occurred as a result of your program. ( Wolske & Higgs, 2007) When one discovers a difference that is statistically significant, it does not n

ecessarily mean that it is big, important, or helpful in decision-making. But as an instructor, it provides that basis for the argument that the assessment inst rument evaluated a curriculum delivery method that was significant. Wolske & Hig gs (2007) stated that to know if an observed difference is not only statistically significant but also important or meaningful, you will need to calculate its ef fect size, which is to use a standardized calculation for as Wolske & Higgs, 2007 ) said, all effect sizes are calculated on a common scale -- which allows you to compare the effectiveness of different programs on the same outcome. DSS Research, Inc., Mail versus Internet Survey, 2000, DSS Research, Inc. Page 1 - 10 640-822 ICND1, Interconnecting Cisco Networking Devices Part 1, IT Certification and Career Paths, retrieved 11/13/2011 from http://www.cisco.com/web/learning/l e3/current_exams/640-822.htm Wolske, K. & Higgs, A., Power Analysis, Statistical Significance, & Effect Size, retrieved 11/13/2011 from http://meera.snre.umich.edu/plan-an-evaluation/plonea rticlemultipage.2007-10-30.3630902539/power-analysis-statistical-significance-ef fect-size Miller, M. J., Reliability and validity, retrieved 11/13/2011 from Western Inter national University, RES 600: Graduate Research Methods, http://www.michaeljmill erphd.com/res500_lecturenotes/reliability_and_validity.pdf

You might also like