You are on page 1of 15

MEASUREMENT IN PHYSICAL EDUCATION AND EXERCISE SCIENCE, 10(4), 241254 Copyright 2006, Lawrence Erlbaum Associates, Inc.

Equivalence Reliability Among the FITNESSGRAM Upper-Body Tests of Muscular Strength and Endurance
Todd Sherman
Division of Physical Education and Dance Oxford College of Emory University

J. P. Barfield
Department of Exercise Science, Physical Education, & Wellness Tennessee Tech University

This study was designed to investigate the equivalence reliability between the suggested FITNESSGRAM muscular strength and endurance test, the 90 push-up (PSU), and alternate FITNESSGRAM tests of upper-body strength and endurance (i.e., modified pull-up [MPU], flexed-arm hang [FAH], and pull-up [PU]). Children (N = 383) in Grades 3 to 6 were tested over a period of a week. Equivalence reliability for the PSUMPU comparison was acceptable for boys across ages 8 to 10 (Percentage Agreement [Pa] = .74 to .78, Modified Kappa [Kq] = .48 to .56), with good agreement among boys age 11 (Pa = .86, Kq = .72). Equivalence reliability for girls was unacceptable across all ages (Pa = .54 to .59, Kq = .04 to .18). Reliability estimates were also acceptable for boys on the PSUFAH comparison across all ages (Pa = .72 to .80, Kq = .44 to .60). Similar results were not found for girls. Consistency of classification was not demonstrated between PSUPU for boys ages 10 and 11 years; however, acceptable to good estimates were found for the girls PSUPU comparison (Pa = .72 to .82, Kq = .44 to .64) with the exception of girls age 11 (Pa = .67, Kq = .34). Practitioners must recognize that using alternative FITNESSGRAM strength and endurance items may result in a different healthy fitness zone classification for children than the recommended test, the PSU. Key words: criterion-reference, childrens strength tests
Correspondence should be sent to Todd Sherman, Oxford College of Emory University, Division of Physical Education and Dance, 100 Hamill Street, Oxford, GA 30054. E-mail: tsherma@learnlink. emory.edu

242

SHERMAN AND BARFIELD

Adequate upper-body strength is necessary for performing functional and daily activities as well as preventing injury and osteoporosis (Kollath, Safrit, Zhu, & Gao, 1991; Pate, Burgess, Woods, Ross, & Baumgartner, 1993; Ross & Pate, 1987). In addition, physical educators use muscular fitness test scores to document healthrelated physical fitness and estimate levels that may yield benefits into adulthood (Cooper Institute for Aerobics Research [CIAR], 1999; Cureton & Warren, 1990). Because of the practicality and the importance of muscular strength and endurance testing, test developers make consistent efforts to include upper-body strength measures in test batteries (Engelman & Morrow, 1991). The FITNESSGRAM health-related physical fitness test battery was developed by the CIAR (1999) and is currently endorsed by the American Alliance for Health, Physical Education, Recreation and Dance. Unique to the FITNESSGRAM, practitioners have the option of using any one of the following FITNESSGRAM field tests to measure upper-body strength and endurance: (a) the traditional pull-up (PU), (b) the modified pull-up (MPU), (c) the 90 push-up (PSU), and (d) the flexed-arm hang (FAH). Although the practitioner may choose to use any of the tests, the PSU is recommended. FITNESSGRAM scores are evaluated against both norm-referenced and criterion-referenced standards. Criterion-referenced standards were established in the late 1970s and early 1980s to help indicate levels of physical fitness needed for good health (Cureton & Warren, 1990). The unique feature of the FITNESSGRAM is that it allows the practitioner the option of administering any of the four FITNESSGRAM upper-body strength tests; hence, a child should receive the same criterion classification (i.e., healthy or unhealthy fitness zone) regardless of the test administered. If tests are used interchangeably, tests must be equivalent (Zhu, 1998). There have been numerous studies on norm-referenced reliability and validity evidence for field tests of upper-body strength and endurance (Cotten, 1990; DiNucci, McCune, & Shows 1990; Engelman & Morrow, 1991; Jackson, Fromme, Plitt, & Mercer 1994; Kollath et al., 1991; McManis, Baumgartner, & Wuest, 2000; Pate et al., 1993; Rutherford & Corbin, 1994); unfortunately, there is limited evidence supporting the consistency of classification across tests (Looney & Plowman, 1990). To this point, Romain and Mahar (2001) have published the only study addressing criterion-referenced equivalence reliability of FITNESSGRAM upper-body strength and endurance items. These researchers evaluated consistency of classification between the PSU and MPU among children but limited the study to Grades 5 and 6 and excluded additional FITNESSGRAM options (i.e., PU, FAH). If tests are not consistent in classification, problems can occur when using test scores to classify whether children are in a healthy fitness zone. Misclassification of a child may lead to an overestimation of appropriate physical activity or a discouragement in participation because the child feels the standard is unachievable. Both outcomes may affect the childs development of an active life-

EQUIVALENCE RELIABILITY

243

style that is conducive to his or her health-related fitness (Cureton & Warren, 1990). Therefore, this study was designed to determine the consistency of classification, or equivalence reliability, between the PSU, the FITNESSGRAMs suggested upper-body muscular strength and endurance test, and other upper-body strength and endurance test options across elementary school ages.

METHODS Participants The participants were a convenience sample of 403 children from one elementary school in a metropolitan area. The children were between 7 and 13 years of age and were enrolled in Grades 3 to 6 physical education classes. Scores collected from children ages 7, 12, and 13 were not included in the analyses because of low sample size. Thus the total number in the sample was 383 (boys n = 201, girls n = 182). Permission from the principal, the director of schools, and the Institutional Review Board (IRB) were obtained prior to testing. Because fitness testing was a part of the childrens physical education curriculum, the IRB committee only required permission from the parents prior to data collection. Instrument The FITNESSGRAM (CIAR, 1999) upper-body strength tests were administered to all participants. The tests included the MPU, the PU, the FAH, and the PSU. Test administration procedures were strictly followed as detailed in the FITNESSGRAM test manual (CIAR, 1999, pp. 2528). Procedures Eight test administrators, consisting of graduate students and faculty members, were utilized for this study. All administrators had prior testing experience; however, the principal investigator required each test administrator to review FITNESSGRAM test administration protocol and to participate in the practice trials 1 week prior to data collection. These practice trials allowed for the identification and remedy of any procedural or scoring problems by the principal investigator and test administrators. Fitness testing was part of the physical education curriculum and students were familiar with test items. Students were given additional instruction on correct performance on all tests and practice time on all test items prior to data collection. Testing was conducted during the 30-min physical education classes over 2 days. Classes met on a MondayWednesday or TuesdayThursday schedule. On Day 1,

244

SHERMAN AND BARFIELD

MPU and FAH were administered. As the children entered the gymnasium, students were divided into three groups of eight. Each group started either at the MPU, FAH, or the height and weight station. Student groups rotated at 9-min intervals. PU, PSU, and an activity station (i.e., jumping rope) were administered on Day 2. Each station group of students had approximately 2 to 3 min of rest collectively before performing at the next station. Because students remained in alphabetical order throughout testing, students rested, individually, approximately 8 min before performing the next test. The principal investigator returned the following week to collect five make-up test scores. Analyses Students were categorized as being in a healthy or unhealthy fitness zone based on the criterion-referenced standard for their age and gender (CIAR, 1999). Percentage Agreement (Pa) was used to determine equivalence reliability for the following comparisons: PSUMPU, PSUFAH, and PSUPU. Equivalence reliability, a term more appropriate to the psychomotor domain, is sometimes called alternate forms reliability; the Pa reflects the extent to which two tests result in the same classification (Morrow, Jackson, Disch, & Mood, 2000). Looney (1989) indicated that reliability studies should include both Pa and Modified Kappa (Kq) when the proportion of masters is not fixed. Where Pa represents the proportion of individuals receiving the same fitness zone classification on two tests and is influenced by chance agreement, Kq represents the proportion of individuals receiving the same fitness zone classification after controlling for chance agreement. The following assumptions for Kq were met: (a) independence among objects to be categorized, and (b) independence and exclusivity of categories. All agreement statistics were calculated using SPSS for Windows (version 10.0) and a statistical Web site (Chuang, 2001).

RESULTS Descriptive statistics for boys (n = 201) and girls (n = 182) are presented in Table 1. The weight and height of the boys and girls increased with each successive age group. This sample was above the national average for both height and weight (Ogden et al., 2002). Mean Body Mass Index (BMI) for boys and girls met the criteria for the healthy fitness zone as stated in the FITNESSGRAM test manual (CIAR, 1999, pp.4041). Mean BMI for 11-year-old boys exceeded the healthy fitness zone criterion. Because some test items yielded many zero scores, Tables 2 and 3 include the number of zero scores, the 25th, 50th, and 75th percentile for all the upper-body strength and endurance tests.

TABLE 1 Means and Standard Deviations for Eight to Eleven Year-Old Boys and Girls Height, Weight and BMI Age 8 Variables Boys Weight Height BMI Girls Weight Height BMI M SD n 46 33.0 132.6 18.6 33.8 132.1 19.2 7.7 5.1 3.4 39 10.9 5.6 5.0 36.6 136.2 19.6 9.2 6.6 3.9 37.0 136.7 19.6 10.4 6.9 4.2 56 42.1 142.5 20.5 12.9 6.6 4.8 M Age 9 SD n 50 40.9 141.5 20.1 11.8 7.9 4.2 44 47.9 148.1 21.4 16.4 8.4 5.3 M Age 10 SD n 61 50.6 150.1 22.0 16.7 8.9 5.2 43 M Age 11 SD n 44

Note. BMI = body mass index. Height is reported in centimeters, weight is reported in kilograms, and BMI is reported as weight in kilograms divided by height in squared meters.

TABLE 2 Boys 25th, 50th, and 75th Percentile Scores for FITNESSGRAMs Tests of Upper-Body Strength and Endurance Number of Zero Scores 0 0 28 10 2 2 31 6 3 1 39 11 2 2 29 17 25th Percentile 4.0 6.0 0.0 1.0 4.0 7.0 0.0 3.0 3.5 4.5 0.0 1.0 5.0 5.0 0.0 0.0 50th Percentile 7.0 10.5 0.0 5.0 8.0 11.0 0.0 5.0 7.0 10.0 0.0 5.0 9.0 9.0 0.0 3.5 75th Percentile 12.0 15.3 1.3 12.3 13.0 15.3 2.0 13.5 14.5 15.0 1.0 12.0 17.0 14.8 1.0 9.75

Age 8a

Test Item PSU MPU PU FAH PSU MPU PU FAH PSU MPU PU FAH PSU MPU PU FAH

9b

10c

11d

Note. MPU = modified pull-up; PSU = 90 push-up; PU = pull-up; and FAH = flexed-arm hang in seconds. an = 46. bn = 50. cn = 61. dn=44.

245

246

SHERMAN AND BARFIELD

TABLE 3 Girls 25th, 50th, and 75th Percentile Statistics for FITNESSGRAMs Tests of Upper-Body Strength and Endurance Test Item PSU MPU PU FAH PSU MPU PU FAH PSU MPU PU FAH PSU MPU PU FAH Number of Zero Scores 3 2 27 13 2 0 44 17 6 1 35 14 0 1 34 1 25th Percentile 2.0 6.0 0.0 0.0 2.0 5.0 0.0 0.0 2.0 3.3 0.0 0.0 2.0 5.0 0.0 1.0 50th Percentile 4.0 9.0 0.0 4.0 4.0 8.0 0.0 4.0 5.0 8.0 0.0 3.0 6.0 10.0 0.0 3.0 75th Percentile 8.0 14.0 1.0 8.0 8.0 11.8 0.0 6.8 8.5 14.5 0.0 8.0 12.0 13.0 0.0 6.0

Age 8a

9b

10c

11d

Note. MPU = modified pull-up; PSU = 90 push-up; PU = pull-up; FAH = flexed-arm hang in seconds. an = 39. bn = 56. cn = 44. dn = 43.

The 90 PSU test is recommended by the FITNESSGRAM; therefore Pa indexes were computed between PSU and all other tests of upper-body strength. Pa indexes between PSU and the other tests of upper-body strength for each age and gender are reported in Tables 4 and 5. Kq was used to correct for chance agreement. When evaluating equivalence reliability, it is important to note that Pa estimates for small sample sizes (N = 30) are unbiased relative to large sample values; however, less evidence is available regarding Kq (Looney, 1989). Therefore, Kq estimates were calculated for children ages 8 and 9 collectively and ages 10 and 11 collectively to increase sample size and improve the ability to generalize findings (Tables 45). Pa and Kq were evaluated by separate criteria. Pa values between .50 and 1.0 are acceptable but values should be closer to 1 than .50 to establish equivalence reliability (Baumgartner, Jackson, Mahar, & Rowe, 2003; Looney, 1989). Kq values greater than .75 are excellent, between .60 to .75 are good, and between .40 to .60 are acceptable estimates of equivalence reliability (Morrow et al., 2000). Student passing rates on test items are included in Tables 6 and 7.

TABLE 4 Boys Percent Agreement and Modified Kappa Indexes Between the Push-Up Test and the FITNESSGRAMs Alternate Tests of Upper-body Strength and Endurance Age 8a 9b 10c 11d 8 and 9e 10 and 11f Statistic Pa Kq Pa Kq Pa Kq Pa Kq Kq Kq PSUMPU .78 .56 .78 .56 .74 .48 .86 .72 .39 .55 PSUFAH .80 .60 .76 .52 .72 .44 .75 .50 .49 .46 PSUPU .61 .22 .62 .24 .74 .48 .70 .40 .29 .47

Note. Pa = percent agreement; Kq = modified kappa; MPU = pull-up; PSU = 90 push-up; PU = pull-up; and FAH = flexed-arm hang. Age is represented in years. Excellent = Kq > .75, Good = .60 Kq .75, and Acceptable = .40 Kq < .60. an = 46. bn = 50. cn = 61. dn = 44. en = 96. fn = 105.

TABLE 5 Girls Percent Agreement Indexes Between the Push-Up Test and the FITNESSGRAMs Alternate Tests of Upper-Body Strength and Endurance Age 8a 9b 10c 11d 8 and 9e 10 and 11f Statistic Pa Kq Pa Kq Pa Kq Pa Kq Kq Kq PSUMPU .54 .08 .48 .04 .59 .18 .58 .16 .11 .25 PSUFAH .56 .12 .64 .28 .75 .50 .67 .34 .24 .40 PSUPU .72 .44 .77 .54 .82 .64 .67 .34 .46 .44

Note. Pa = percentage agreement; kq = modified Kappa; MPU = modified pull-up; PSU = 90 push-up; PU = pull-up; and FAH = flexed-arm. Excellent = Kq > .75, Good = .60 Kq .75, and Acceptable = .40 Kq < .60. Age is represented in years. an = 39. bn = 56. cn = 44. dn = 43. en = 95. fn = 87.

247

248

SHERMAN AND BARFIELD

TABLE 6 Passing Rates (%) on Upper Body Strength and Endurance Items for Boys Age 8 9 10 11 PSU 70 68 53 56 MPU 87 90 75 68 FAH 63 76 61 39 PU 39 38 36 34

Note. MPU = modified pull-up; PSU = 90 push-up; PU = pull-up; FAH = flexed-arm hang. Age is represented in years.

TABLE 7 Passing Rates (%) on Upper Body Strength and Endurance Items for Girls Age 8 9 10 11 PSU 46 41 34 49 MPU 92 89 75 86 FAH 54 59 45 26 PU 28 21 21 21

Note. MPU = modified pull-up; PSU = 90 push-up; PU = pull-up; FAH = flexed-arm hang. Age is represented in years.

Boys Using these criteria, equivalence reliability for the PSUMPU comparison was acceptable for boys ages 8 to 10 (Pa = .74 to .78, Kq = .48 to .56), with good agreement for boys age 11 (Pa = .86; Kq = .72). Reliability estimates were also acceptable for the PSUFAH comparison across all ages (Pa = .72 to .80, Kq = .44 to .60); however, consistency of classification was not demonstrated between PSUPU. Estimates were unacceptable for boys ages 8 and 9 (Pa = .61 to .62; Kq = .22 to .24) and barely acceptable for boys ages 10 and 11 (Kq = .40 to .48).

Girls Equivalence reliability for the PSUMPU comparison was unacceptable across all ages (Pa = .48 to .59; Kq = .04 to .18). Similar results were found for the PSUFAH comparison (Table 5) except for 10-year-olds. Acceptable to good estimates were found for the PSUPU comparison (Pa = .72 to .82, Kq = .44 to .64) with the exception of girls age 11 (Kq = .34).

EQUIVALENCE RELIABILITY

249

DISCUSSION The study was designed to determine the equivalence reliability, or alternate forms reliability, between the suggested FITNESSGRAM upper-body strength test, the PSU, and alternative test choices. If high equivalence reliability exists among test items, practitioners may use test items interchangeably and feel confident that a childs fitness zone classification will be consistent across tests. If low reliability exists, FITNESSGRAM researchers may need to adjust healthy fitness zone criteria or reevaluate optional test items. Boys

PSUMPU. In this study, equivalence reliability estimates were acceptable for the PSUMPU comparison for boys; however, these estimates were insufficient to conclude that tests can be used interchangeably. Looney (1989) indicated that a high percentage of masters and large sample variability increases classification agreement as described by the proportion of agreement. Although the PSU and MPU data had these characteristics, far too many boys were misclassified based on test choice. Reflective of contingency tables, approximately 20% of boys were classified differently between tests at each age level and the majority of misclassified boys passed the MPU standard but failed the PSU. When the influence of chance is removed, only 48% to 72% of boys, depending on age, would be expected to receive the same classification on both tests. Acceptable equivalency was noted but classification agreement can be improved (Table 4). Based on these findings, one cannot conclude that these tests are truly equivalent in terms of healthy fitness zone classification. PSUFAH. Reliability estimates for the PSUFAH comparison were also statistically acceptable but insufficient to conclude that evaluation standards are equal. Contingency tables revealed 25% of boys at each age level were classified differently between tests and misclassification occurred in both directions. Similar to the Kq estimates for the PSUMPU comparison, equivalency reliability was acceptable but can be improved, (Kq = .44 to .60). Also, more zero scores were recorded for the FAH than the PSU or the MPU (Table 2). Therefore, practitioners may want to consider the usefulness of the FAH, in addition to classification consistency, because many boys will not be able to complete one attempt. PSUPU. Equivalence reliability estimates were unacceptable for the PSUPU comparison (Table 4). The high percentage of zero scores on the PU (63% of all boys) contributed to low passing rates (Table 6), despite the minimal criteria (i.e., 1 PU). Poor performance on the PU has been documented elsewhere

250

SHERMAN AND BARFIELD

(Engelman & Morrow, 1991; Ross, Dotson, Gilbert, & Katz, 1985). We recommend that the PU be removed from future editions of the FITNESSGRAM due to test difficulty. Minimal criterion-referenced reliability of FITNESSGRAM upper-body strength and endurance items is present in the literature (Kollath et al., 1991; Romain & Mahar, 2001), especially in terms of equivalence reliability. Romain and Mahar (2001) published the initial equivalence reliability study specific to FITNESSGRAM upper-body test items. These researchers compared the classification consistency between the PSU and MPU in boys (n = 30) and girls (n = 32) in Grades 5 and 6. Among boys, Romain and Mahar reported Pa = .70 and Kq = .40 and suggested that these reliability estimates were not acceptable. Normreferenced PSU and MPU mean scores were similar to those reported in this study but our reliability estimates were slightly higher and extended the test comparisons across all alternative FITNESSGRAM items. Nonetheless, we agree with Romain and Mahar that classification consistency must improve before test items are used interchangeably. Girls

PSUMPU. Equivalence reliability estimates were unacceptable for the PSUMPU comparison across all ages (Table 5). Percentage agreement ranged from .48 to .59. Based on contingency tables, over 40% of girls in each group were classified differently between tests and the majority of misclassified girls passed the MPU but failed the PSU. If the influence of chance is removed, the equivalency reliability is unacceptable (Kq .20). Practitioners should not be encouraged to use the PSU and MPU interchangeably to evaluate criterion-referenced muscular strength and endurance performance for girls. PSUFAH. Reliability estimates for the PSUFAH comparisons were also unacceptable across ages, with the exception of age 10 (Table 5). Pa statistics ranged from .56 to .64, and contingency tables revealed that 36% to 44% of girls ages 8, 9, and 11 were classified differently between tests with the majority passing the FAH and failing the PSU. If chance agreement is controlled, equivalency reliability is unacceptable. Additionally, an unacceptable number of girls received a zero score on the FAH (45 of 182). Similar to the recommendation for the PSUMPU comparison, practitioners should not use the PSU and FAH interchangeably to assess upper-body strength. PSUPU. In contrast to boys, the highest classification agreement for girls was recorded for the PSUPU comparison. Sixty-seven percent to 82% of girls received the same classification on both test items. Although Pa estimates were acceptable with the exception of age 11 (Table 5), Kq estimates indicated that less

EQUIVALENCE RELIABILITY

251

than 55% of girls would be classified the same if chance agreement was controlled. More importantly, the PU test was too difficult to complete. The majority of girls could not complete 1 PU (140 of 182 failed the test). In this case, acceptable equivalence reliability was due to test difficulty on both items and should not indicate that both items are appropriate evaluations of upper-body muscular strength. The authors recommend that the PU be eliminated as a test choice from the FITNESSGRAM. Furthemorer, it appears evident that healthy fitness zone criteria for the PSU are too high for girls and should be lowered to increase classification consistency with MPU and FAH tests. The investigation by Roman and Mahar (2001) documented lower mean PSU and MPU scores than this sample and also documented unacceptable reliability (Pa = .69, Kq = .38) between the PSU and MPU. These researchers concluded that criterion consistency among strength items needed to be addressed. If the PSU remains the recommended FITNESSGRAM muscular strength and endurance test item recommendation for girls, classification consistency with optional test items must be improved.

Application Equivalence reliability between the recommended FITNESSGRAM upperbody strength test, the PSU, and alternative muscular strength and endurance items needs improvement. This finding is consistent with the lack of agreement between the PSU and MPU documented by Romain and Mahar (2001). Although some reliability estimates were statistically acceptable across specific comparisons (Pa), most comparisons for both boys and girls were unacceptable for practical usage when chance agreement was considered (Kq). Consistency of classification across future recommended tests and criteria must improve if practitioners use FITNESSGRAM battery data to assess health-related fitness (i.e., muscular fitness).

Validity. Equivalence reliability addresses one measurement consideration of FITNESSGRAM muscular strength and endurance test items. Specifically, this study addresses the appropriateness of using test items interchangeably to classify a childs score into the healthy or unhealthy fitness zone; however, criterionreferenced equivalence reliability estimates, whether high or low, should not dictate the inclusion or exclusion of muscular strength and endurance items within the battery. The validity of criterion-referenced standards is an additional, but interrelated, measurement issue. Although the authors of this study make specific recommendations for test item alterations, these recommendations must be considered within a larger theoretical framework relative to FITNESSGRAM test items and evaluation criteria.

252

SHERMAN AND BARFIELD

Appropriate criterion-referenced standards are difficult to establish (Morrow et al., 2000). It is disturbing, but not surprising, that equivalence reliability across FITNESSGRAM items is inadequate. One explanation is that upper-body strength and endurance items vary in difficulty (Zhu, 1998) and make it difficult to set equivalent standards. Lack of agreement could also be due to the varying muscle groups that each test emphasizes (Engelman & Morrow, 1991; Pate et al., 1993). For example, PSU emphasizes the pectoralis major and triceps whereas the PU emphasizes the latissimus dorsi and biceps. Engelman and Morrow (1991) documented a moderate relation between norm-referenced MPU and PU scores. These researchers reported correlation coefficients of .49 to .71 among boys and girls in Grades 3 to 5. Pate and colleagues (1993) reported correlations between .40 to .71 between the PSU and MPU. Although not specific to criterion-referenced standards, inadequate correlations among test items reinforces that various upper-body tests measure varying components of strength. To this point, no empirical efforts have been made to validate criterionreferenced standards of strength and endurance test items. Looney and Plowman (1990) have reported one method to establish valid criterion standards. These researchers suggested that two groups of students, one active and one inactive, be tested on a specific item. The active group utilizes muscular strength and endurance for everyday use and possesses a suitable level of function whereas the inactive group possesses a distinctly lower level of function. The intersection of sample distributions is therefore selected as the appropriate standard. This research method would be useful in the future study and evaluation of FITNESSGRAM test items. Equating tests is an additional method that can be used to address the validity of criterion-referenced standards. Although a variety of statistical options are available, Zhu (1998) indicated that traditional equating is an appropriate method for two muscular strength tests (i.e., use of z scores to equate tests). Using this method, one can determine a score for an alternative test that corresponds to a specific score on the gold standard. For example, a PSU score of 7 may correspond to a MPU score of 10. FITNESSGRAM healthy fitness zone criteria should parallel equated scores. Although this sample size is not sufficient to draw inferences, equated scores between the PSU and other test items for 10-year-old boys in this sample did not reflect the same health zone classifications. Although the purpose of this study is to address classification consistency across tests, one must consider the validity of each test when drawing conclusions from reliability estimates.

Reliability. High criterion-referenced testretest reliability coefficients have been documented for both the PSU (Pa = .97, Kq = .94) and MPU (Pa = .95, Kq = .90; Romain & Mahar, 2001) but have not been documented for either the PU or FAH. A major limitation of this study is that criterion-referenced reliability was

EQUIVALENCE RELIABILITY

253

not established for all test items either prior to or during testing. Indeed, equivalence reliability among test scores is influenced by testretest consistency. Although norm-referenced testretest reliability coefficients reported for the PU and FAH have been high (Cotten, 1990; Engelman & Morrow, 1991; Pate et al., 1993), it is difficult to conclude if classification inconsistency in this study resulted from error between tests, within participants, or both. The FITNESSGRAM is currently in its third edition. Results from this study suggest that upper-body muscular strength criterion standards should be investigated further for boys and girls in Grades 3 to 6 relative to equivalence reliability. The FITNESSGRAM standards have been adjusted over time to reflect appropriate standards and further study relative to equivalence reliability will enhance usefulness and application of the test battery.

REFERENCES
Baumgartner, T. A., Jackson, A. S., Mahar, M. T., & Rowe, D. A. (2003). Measurement for evaluation in physical education and exercise science (7th ed.). Boston: McGraw-Hill. Chuang, J. H. (2001). Agreement between categorical measurements: Kappa statistics. Retrieved May 5, 2001, from Columbia University, Department of Medical Informatics Web site: http://www. dmi.columbia.edu/homePages/chuangj/kappa/ Cooper Institute for Aerobics Research. (1999). The Prudential FITNESSGRAM test administration manual. Dallas, TX: Author. Cotten, D. J. (1990). An analysis of the NCYFS II modified pull-up test. Research Quarterly for Exercise and Sport, 61, 272274. Cureton, K. J., & Warren, G. L. (1990). Criterion-referenced standards for youth health-related tests: A tutorial. Research Quarterly for Exercise and Sport, 61, 719. DiNucci, J., McCune, D., & Shows, D. (1990). Reliability of a modification of the health-related physical fitness test for use with physical education majors. Research Quarterly for Exercise and Sport, 61, 2025. Engelman, M. E., & Morrow, J. R., Jr. (1991). Reliability and skinfold correlates for traditional and modified pull-ups in children grades 35. Research Quarterly for Exercise and Sport, 62, 8891. Jackson, A. W., Fromme, C., Plitt, H., & Mercer, J. (1994). Reliability and validity of a 1-minute 90 push-up test for young adults [Abstract]. Research Quarterly for Exercise in Sport, 65(Suppl. 1), A57A58. Kollath, J., Safrit, J., Zhu, W., & Gao, L. (1991). Measurement errors in modified pull-ups testing. Research Quarterly for Exercise and Sport, 62, 432435. Looney, M. (1989). Criterion-referenced measurement: Reliability. In M. Safrit & T. Wood (Eds.), Measurement concepts in physical education and exercise science (pp. 137152). St. Louis, MO: Mosby. Looney, M. A., & Plowman, S. A. (1990). Passing rates of American children and youth on the FITNESSGRAM criterion-referenced physical fitness standards. Research Quarterly for Exercise and Sport, 61, 215223. McManis, B. G., Baumgartner, T. A., & Wuest, D. A. (2000). Objectivity and reliability of the 90 push-up test. Measurement in Physical Education and Exercise Science, 4, 5767.

254

SHERMAN AND BARFIELD

Morrow, J. R., Jackson, A. W., Disch, J. G., & Mood, D. P. (2000). Measurement and evaluation in human performance (2nd ed.). Champaign, IL: Human Kinetics. Ogden, C. L., Kuczmarski, R. J., Flegal, K. M., Mei, Z., Guo S., Wei R., et al. (2002). Centers for Disease Control and Prevention 2000 growth charts for the United States: Improvements to the 1977 National Center for Health Statistics version. Pediatrics, 109, 4560. Pate, R., Burgess, M., Woods, J., Ross, J., & Baumgartner, T. (1993). Validity of field tests of upper body muscular strength. Research Quarterly for Exercise and Sport, 64, 1724. Romain, B. S., & Mahar, M. T. (2001). Norm-referenced and criterion-referenced reliability of the 90 push-up and modified pull-up. Measurement in Physical Education and Exercise Science, 5, 6780. Ross, J., Dotson, C., Gilbert, G., & Katz, S. (1985). New standards for fitness measurement. Journal of Physical Education, Recreation, and Dance, 56(1), 6670. Ross, J. G., & Pate, R. R. (1987). The national children and youth fitness study II: A summary of findings. Journal of Physical Education, Recreation, and Dance, 58(9), 5156. Rutherford, W. J., & Corbin, C. B. (1994). Validation of criterion-referenced standards for tests of arm and shoulder girdle strength and endurance. Research Quarterly for Exercise and Sport, 65, 110119. Zhu, W. (1998). Test equating: What, why, how? Research Quarterly, 69, 1123.

You might also like