You are on page 1of 103

EU PRE-PILOT on LEARNING TO LEARN

Report on the compiled data


2008-1190/001-001 TRA-TRINDC

Sirkku Kupiainen, Jarkko Hautamki, Pekka Rantanen University of Helsinki Centre for Educational Assessment 2008

Contents

1 2

Introduction ............................................................................................................... 3 Implementing the pre-pilot of the learning to learn instrument .................................. 6 2.1 2.2 2.3 2.4 The test ............................................................................................................. 7 The sample........................................................................................................ 8 Organization of testing .................................................................................... 15 Attrition ............................................................................................................ 16 The cognitive domain ...................................................................................... 19 Results according to the original scales .................................................. 20 The L2L cognitive construct(s) ................................................................ 28 Item-level analysis of the cognitive component ....................................... 30 Conclusions regarding the cognitive component..................................... 42 Results according to the original scales .................................................. 48 The L2L affective constructs.................................................................... 60 Item-level analysis of the affective component........................................ 70 Conclusions regarding the affective component...................................... 73 The metacognitive monitoring task.......................................................... 79 Metacognitive accuracy and confidence.................................................. 79 An index for metacognitive accuracy and confidence ............................. 85 Conclusions regarding metacognition ..................................................... 86

Results .................................................................................................................... 19 3.1 3.1.1 3.1.2 3.1.3 3.1.4 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.4 3.5

The affective domain ....................................................................................... 48

The metacognitive domain .............................................................................. 78

The three-dimension framework of learning to learn ....................................... 87 Students view on the test ............................................................................... 93

Learnings from the learning to learn pre-pilot test recommendations for the next stage ...................................................................................................................... 97

1 Introduction
Learning to learn has been the centre of growing educational interest for more than a decade. Already before the nomination of learning to learn as one of the eight key competences by the Commission expert group on key competences in March 2002, several interested European countries had convened in pursuit of a common instrument for measuring learning to learn, supported by the EU Commission. Since that, much progress has been made, culminating in the current endeavour for an indicator to the measuring of learning to learn, realised in collaboration between the European Commission (DG EAC and CRELL) and member states, leading to the test pre-piloted in eight European countries in spring 2008 (c.f., Hoskins & Fredriksson, 2008). The pre-piloted test, built from components from four original tests more or less closely related to the concept learning to learn, comprises three domains: the cognitive, the affective, and the metacognitive, reflecting the three key dimensions of an individuals readiness willingness and ability for life long learning. The results of the pre-pilot were originally presented, discussed and summed up at country level in a meeting in June 2008 between representatives of the European Commission, the expert group set up by the Commission for the project, and representatives of the participating and other interested countries. The present report, based on the compiled data, analysed and summed up by the University of Helsinki Centre for Educational Assessment (CEA), based on the Tender EAC/06/2008, can be seen to represent the end of the current phase in the quest for an indicator for one of the key competences, learning to learn.

The path leading to the pre-pilot has not only been contingent on the search for an acceptable definition for the term learning to learn but also on the differing theoretical paradigms from which the concept has been approached in efforts to build an indicator for its measuring (Hoskins & Fredriksson, 2008, p. 16). The desire to overcome these differences by compiling a test, based on a common European framework but adopting tasks from tests rising from different theoretical backgrounds and conceptualisations of learning to learn, has lead to an instrument marked by the paradigmatic differences between these earlier endeavours. These differences are also visible in the results of the pre-pilot. Many of the problems revealed in the analysing of the compiled data were already present in the national data and discussed in the respective national reports, but only the compiled data allows the use of statistical methods powerful enough to ascertain their existence and, to point to ways to overcome them. Some of the differences between the tasks in the pre-piloted test are due to differences in the level of statistical rigorousness required of tasks and scales in the original tests. Accordingly, the recommendations based on such statistical manoeuvres may not always be felt to be warranted or to solve the underlying differences of approach. We hope, however, that the analyses and conclusions presented in this report will be of help in the ensuing discussion, as that is what we have been asked to offer to direct an inquiring eye on the data, using statistical methods appropriate for the empirical data in question. However, while statistical methods offer information on the technical adequacy of a scale or a construct, they do not offer advice on their ecological validity in measuring a multidimensional concept such as learning to learn, seen to present the first phase on a road to a further-away goal, life long learning. Hence, some caution is in order regarding the use of GPA (grade point average) as a proxy for or as the goal to be attained by learning to learn competence in the models. This, too, might accentuate the differences between the two main paradigmatic stances behind the test, with the cognitive psychology paradigm maybe more willingly accepting the use of the GPA as a reference

in representing school as a (or the) central developmental forum for the competence use of the age group while the social cultural paradigm might rather be looking for a broader point of reference in an effort to catch that kind of readiness to learn which might not make itself seen in formal schooling. However, the use of GPA is based on the simple fact that it was the only available reference outside of the test. The report focuses on the empirical data received from the eight countries participating in the pre-pilot. Due to the differing number of students taking the test in the eight countries, all main analyses have been made with and conclusions drawn on weighted data to counteract the effect of these differences. This does not, however, nullify possible bias in either the combined or the country level data, accentuating the importance of not interpreting any of the results as indicators of actual differences in level of attainment or opinion, especially when the analyses concern between-country differences. Throughout the report it is to be kept in mind that any differences found might mean differences in sampling, in the implementation of the test, or in the respective translations, rather than, or as well as, real differences between the groups discussed. Naturally, the samples were drawn to be as comparable as possible by the use of common guidelines, but differences in the respective samples recorded in Chapter 2 should point out the need to abstain from any country-level comparisons. After all, the endeavour is only about testing the test. The report comprises three main chapters. First, in Chapter 2, a short presentation of the content of the test and the sample will be provided, including a discussion on attrition, to be kept constantly in mind when interpreting the results. Second, in Chapter 3, results for all the three domains covered by the test the cognitive, the affective and the metacognitive will be reported and discussed. Each domain will be first analysed according to the original tasks and scales that the dimensions are built around. Next, the data of each domain will be analysed according to the conceptual structure of the European framework (Hoskins & Fredriksson, 2008, 29), followed by a SEM analyses estimating the fit between the empirical data and the conceptual model. After that, allowed by the size of the compiled data, item-level analyses will be presented, using Rasch modelling. Each subchapter will conclude with tentative recommendations for

improving the test in that particular domain. Chapter 3 also includes an overview of the results of the questionnaires in the end of each booklet, surveying students opinions on the test. Finally, in Chapter 4, a concise summary of the results will be presented, covering the three main domains and proposing steps to be taken for the actual pilot.

2 Implementing the pre-pilot of the learning to learn instrument


The pre-pilot was implemented in all participating countries successfully, according to the guidelines recorded in the cooperatively prepared preliminary documents. In no country did the implementation meet major obstacles, most probably reflecting the careful preliminary planning and work done in the respective countries before the actual implementation, necessitated by the late finalisation of the test. Testing was done within the agreed time frame and sample size in all countries, following the guidelines made for a feasibly representative sample which would allow predicting the working of the test in the country also with a representative sample. The details of the implementation are well documented in the respective country reports and in the synopsis made in the June 2008 meeting, and will not be discussed here. Instead, it is assumed that the representatives of the participating countries will deliberate on the results reported and the conclusions made in this report in light of their respective national experience of the test. There might not be reason to think that the results would be very different even with a somewhat different sample or form of implementation in each country. However, regarding the further development of the test, both the level of difficulty of the cognitive tasks and the quality of language and contextual setting of the affective items should be reconsidered carefully from the point of view of a more representative sample, maybe including schools where testing be it high stakes or noncurricular may encounter resistance. 6

2.1 The test


Reflecting the framework, the test comprised three separate dimensions, the cognitive, the affective and the metacognitive. These have been further operationalised into subdimensions 1 (Hoskins & Fredriksson, 2008, 28-29): The cognitive dimension Identifying a proposition Using rules Testing rules and propositions Using mental tools The affective dimension Learning motivation, learning strategies and orientation toward change Academic self-concept and self-esteem Learning environment The metacognitive dimension A metacognitive monitoring task Metacognitive accuracy Metacognitive confidence The cognitive dimension was measured by five tasks, referred to in this report by names CCST, LagSev, Choco, Text and Lakes, comprising 16, 6, 21, 16 and 6 items, respectively, presented in this order in the report due to their order in the two test booklets. The affective dimension was divided into two questionnaires of 49 and 40 items, respectively, one in each of the two booklets. The questions covering

The order in which the three dimensions are presented throughout the report follows the guidelines agreed on for writing the national reports and does not represent a stance on the relative importance of each.

metacognitive accuracy and confidence were attached to three other cognitive tasks on side of the metacognitive monitoring task. The test was divided into two booklets, presented to the students in one testing session of 2x45 minutes, with a five minute interval. Each booklet comprised first the questionnaire presenting the items of the affective domain, followed by three or four (Booklet 1 and Booklet 2) tasks measuring cognitive competence (including the metacognitive monitoring task), and ending with a one-page questionnaire regarding students opinion on the booklet they had just done. No examples of the tasks will be given in this report to allow for their further use in the pilot phase.

2.2 The sample


The number of participating schools and students in the different countries are shown in Table 1.
Table 1 Number of schools, classes and students participating in the study
2

SCHOOLS AUSTRIA CYPRUS FINLAND FRANCE ITALY PORTUGAL SLOVENIA SPAIN Total 8 5 6 10 4 5 5 6 49

CLASSES 16 5 35 19 12 ~10 14 6 155

STUDENTS 371 120 592 380 261 216 240 145 2325

BOYS 167 56 292 180 140 79 120 72 1106

GIRLS 201 64 291 200 117 134 120 73 1200

In Portugal, the sample was not class-based so the number of classes and has been left out of the analyses of between-class differences. Then number in the table has been approximated based on the class sizes of the other countries just to show the relative size of the sample in terms of classes.

The samples varied significantly in size, from 120 to 592. In four countries, girls outnumbered boys while in only one, the situation was reversed. The difference was largest in Portugal where only a third of the sample was boys. Due to the different sample sizes, all appropriate analyses have been made using country-based weights for each country to have the same computational number of students (N=592) 3 . However, as this is done at student level, the correction will not balance the gender imbalance. In all countries, a great deal of effort was made to ensure a varied, even if not representative, sample of the chosen age cohort of 14-year-olds. For this purpose, two regional variables were chosen: Region 1, referring to a stratification based on urban / rural areas, and Region 2, referring to a stratification based on capital / industrial / agricultural / coastal regions (Table 2).
Table 2 Regional stratification of the sample

Urban Capital Industrial Agricultural Coastal Total 1052 464 253 150 1919

Rural 134 272 406

The agreed target population of the pre-pilot was the grade level where the majority of the countrys 14-year-olds are enrolled. In some countries this meant the equivalent of grade nine and in some, grade eight. No comparisons based on grade level, i.e., on the length of education students have received, have been made. Instead, to look at the impact of age on test results or the student-test fit, students have been divided to four groups based on their age at the time of testing (Table 3). Despite the effort for comparable samples, the mean age difference between the youngest and the oldest age

One Cypriot student had only answered some of the questions in the two questionnaires so there is no cognitive data for his/her weighted presence. Also, as the weight was set at country level, it does not correct for the gender imbalance.

quartile is more than a calendar year, and that between the youngest student and the oldest student almost three years.

10

Table 3

Age of students by age group (quartiles)

N Group 1 Group 2 Group 3 Group 4 Mean

Mean 14.1 14.5 14.8 15.4 14.7

SD 0.23 0.08 0.09 0.43 0.53

Min 13.1 14.4 14.7 15.0 13.1

Max 14.4 14.7 15.0 17.2 17.2

Due to differences in the cut point for the beginning age of school and in the retention or class repeating policies of the respective countries, the age of students varied considerably from country to country. The Austrian students were on average clearly younger and the French students somewhat older than those of other countries (Table 4) 4 .
Table 4 Percentage of students in the different age groups in the participating countries

Mean Austria Cyprus Finland France Italy Portugal Slovenia Spain 14.2 14.7 14.8 15.1 14.9 14.6 14.7 14.8

Group 1 70 % 16 % 13 % 10 % 22 % 31 % 19 % 18 %

Group 2 19 % 28 % 27 % 21 % 19 % 26 % 33 % 28 %

Group 3 5% 32 % 33 % 22 % 24 % 25 % 31 % 28 %

Group 4 6% 24 % 27 % 47 % 35 % 18 % 17 % 26 %

To determine the role of language proficiency in students ability to take the test successfully, they were asked about the language they had learnt first (Table 5).

In Portugal, the age variation in the sample does not reflect the real situation as only the 14- and 15-yearold students at the designated grade level were included in the sample, probably causing the gender imbalance in the sample.

11

Table 5

First language of student

ORIGINAL N Language of instruction Language of instruction and some other language Language other than that of instruction Missing Total 1878 254 126 67 2325 % 81 % 11 % 5% 3% 100 %

WEIGHTED N 3868 581 209 78 4736 % 82 % 12 % 4% 2% 100 %

Four out of five students spoke the language of instruction as their only first language, while an additional ten percent had learnt it already at home on side of some other language. The share of students whose first language was other than the language of instruction was biggest in Austria, France and Italy, and smallest in Finland and Portugal. The countries differed as to the languages most often spoken as first language by students for whom the language of instruction was not their first language. Overall, the largest groups were the speakers of English, Arabic and Turkish (48, 32 and 20 students, respectively). Unlike in telling the first language they had learned to speak, students had clear difficulties telling their parents occupations, reflected in an attrition of near to 25 % due to either non-responding or stating the occupation in such a way that it could not be easily classified. Students were a little more knowledgeable of their mothers than of their fathers, and of parents education (Table 6) than of their occupation. This difference, however, might just reflect the easiness of multiple choice ticking (educational level) compared to an open answer (occupation).

12

Table 6

Parents education

ORIGINAL MOTHER N Primary Secondary Tertiary Missing Total 433 870 694 328 2325 % 19 % 37 % 30 % 14 % 100 % FATHER N 459 786 689 391 2325 % 24 % 34 % 30 % 17 % 100 %

WEIGHTED MOTHER N 889 1851 1504 491 4736 % 19 % 39 % 32 % 10 % 100 % FATHER N 910 1787 1450 589 4736 % 19 % 38 % 31 % 12 % 100 %

There were clear differences between the countries in the education of the parents of the sampled children either reflecting real population-level differences between the countries or differences in the samples (Table 7). The students of Cyprus, Finland, Slovenia and Spain seem to come from relatively more educated homes, even if in Finland, as in Austria and France, the share of students who did not report on their parents education was clearly higher than in the other countries. In Portugal, the share of students whose parents have attained only primary level education is considerably higher than that in any of the other countries. At the other end of the scale, Cyprus stands in a class of its own with half of the mothers having attained tertiary education.
Table 7 Parents education by country

MOTHER Primary Secondary Tertiary Not known FATHER Primary Secondary Tertiary Not known

Austria 20 % 38 % 11 % 31 % Austria 22 % 27 % 17 % 34 %

Cyprus 6% 41 % 51 % 3% Cyprus 3% 51 % 43 % 3%

Finland 8% 33 % 38 % 21 % Finland 10 % 29 % 36 % 24 %

France 32 % 32 % 20 % 17 % France 33 % 24 % 20 % 23 %

Italy 20 % 40 % 37 % 3% Italy 18 % 35 % 43 % 3%

Portugal 44 % 31 % 23 % 2% Portugal 51 % 30 % 15 % 3%

Slovenia 7% 54 % 38 % 1% Slovenia 8% 54 % 36 % 3%

Spain 14 % 44 % 37 % 6% Spain 8% 52 % 34 % 6%

13

In most countries, a considerable part of students have not been able to tell their parents occupation at least in a way that could have been reliably classified (Table 8). And, as with education, there are clear differences between the countries in the share of parents in the different occupational classes, complicating efforts to separate and compare across countries the effect of the different factors. For example, while the share of mothers falling into the two highest classes of legislators / managers and professionals is 22 % across all the eight countries, in Cyprus their share is 43 %. This high parental occupational status is not reflected in higher attainment among the students, however, at least not compared to the relative attainment of students with similar background in the other countries.
Table 8 Parents occupation (weighted data)

MOTHER N Armed forces Legislators, managers Professionals Technicians Clerks Service and sales workers Agricultural and fishery Craft and related trade Machinery operators Elementary occupations Other Missing Total 253 799 656 516 659 35 177 50 559 499 532 4736 5% 17 % 14 % 11 % 14 % 1% 4% 1% 12 % 11 % 11 % 100 % %

FATHER N 85 465 777 559 287 512 77 779 292 328 89 487 4736 % 2% 10 % 16 % 12 % 6% 11 % 2% 16 % 6% 7% 2% 10 % 100 %

Reflecting the samples bias in most countries toward urban regions, the participating students home background is most probably above the average in most if not all countries. This emphasises the importance of not reading the results of the pre-pilot as representative of the countries as such. Also, apparent incongruity in students responses concerning their parents education and occupation seems to imply that the information has to be interpreted as only an approximation of the real situation. 14

2.3 Organization of testing


The test was implemented in all countries within the agreed time limit of Mach and May 2008. The countries followed their respective local procedures concerning the implementation of tests in the schools. I.e., in some countries the test was organised and administered personally by the unit responsible for the pre-pilot, while in others the test was administered by teachers. Even when the latter procedure was used, a representative of the unit responsible for the pre-pilot was present at school at the time of testing to advise and help the teachers. In all countries, the Focus Group discussions after the test were organised and run by someone from the respective research group / centre. In most countries and most schools, the time marked out for the test allowed easily the agreed 45 min + pause + 45 minute structure set in the instructions. In all countries, after the testing, the research assistants organised and managed the Focus Group interviews with students that the teachers / schools had designated according to the instructions given by the test organisers for maximum variance of student type. The number of participants in the groups varied between four and six and comprised in all schools both girls and boys. There were some differences in how the different booklets (the two variations of the first booklet, 1A and 1B, and the two variations of the second booklet, 2A and 2B) were distributed to the students in the different countries. In some countries (Cyprus, France and Portugal), the sampling of booklets was done maximally, producing eight different combinations, while in some (Austria, Finland and Spain), four combinations were used, and in Italy and Slovenia all student began with Booklet 1 A or B, followed by 2 A or 2B, respectively. Overall, little over half of the students made the test using either booklets 1A and 2A or booklets 1B and 2B, in that order, while a little over a third of students

15

began with either booklet 2A or 2B (Table 9). Accordingly, it will be difficult to separate the possible effect of the order in which students did the booklets or of the order of the tasks in them from differences in the overall performance of the students of the different countries.
Table 9 Order of booklets

BOOKLET ORDER 1A followed by 2A 1A followed by 2B 1B followed by 2B 1B followed by 2A 2A followed by 1A 2A followed by 1B 2B followed by 1B 2B followed by 1A

N 625 123 580 122 346 100 320 104

% 27 5 25 5 15 4 14 4

Of the comments made by the teachers, the most common ones concerned students not been able to tell their parents education or occupation, accidental mistakes or difficult words in the tasks, and the lay-out of some of the tasks that seem to have made it difficult for students to understand the nature of the task (this especially in the task Chocolate).

2.4 Attrition
Estimating external attrition in the pre-pilot is complicated by the fact that many countries seem to have reported just the students present in the classes at the moment of testing, i.e., not the number of students normally enrolled in the sampled classes. The actual attrition rate might thus be somewhat higher than reported here. However, as the data was not collected to compare the actual learning to learn competence of students in the eight countries but to assess the working of the test as an indicator for it, possible differences in attrition are not of great importance. Still, as attrition tends to be higher and more systematic among weaker students, the possibility of the results of the pre-pilot

16

being skewed in this respect should be kept in mind when assessing the functioning of the instrument. From the point of view of the working of the test, internal attrition items left unanswered by students who were present in the test situation is of greater importance. In this respect, there are clear differences both between and within the tasks, and between the participating countries. Among the cognitive tasks, item-level attrition was highest in the task Lakes (23 %) and lowest in the cognitive part of the Metacognitive task (4 %). Attrition was 14 % in the Choco task while in the CCST, LagSev and Text tasks it was between 5 % and 8 %. It could be concluded that the more school-like or non-cognitivelooking the tasks are, the readier students are to tackle them. However, in all tasks there were items that attracted fewer responses than the others, most often reflecting the real or face-value difficulty of the item. Students in different countries also seem to have answered the items with varying diligence. In all tasks, item-level attrition is highest in Austria. However, even if there are also clear differences in attrition related to students age in three of the tasks (CCST 2, Lakes and Choco, F=9, 11 and 15, respectively, all p<.001) the differences seem not to explain all of the higher Austrian attrition. Excluding Austria, the differences between age groups disappear except for a slight difference in CCST 2 (F=3, p<.05) while differences between countries weaken but remain in all tests (F=3-6, p<.05-.001). There are also differences in attrition in some tasks related to students school achievement (GPA) (F=3-7, p<.05-.001; no difference in CCST2 or the Metacognitive task), indicating that at least in some tasks non-answering is related to the apparent or real difficulty of the items. One clear reason for item and task-level attrition is the order in which students did the booklets and in which the tasks were set in them (Booklets 1 and 2, versions A and B). The impact was most prominent in the task CCST 2 (see above) but only if the students did the version B of booklet 2, in which CCST 2 was the last task, as their first booklet (attrition 21 % vs. 5 % for those who did booklet 2B or booklet 2A as their second

17

booklet and only 3 % for those who did booklet 2A as their second booklet). A similar, even if smaller, impact of booklet order was to be seen in the Lakes task between students doing booklet 2A where this task was the last one (attrition 34 % for students doing booklet 2A as the first one vs. 20 % when it was the second). But as attrition in this task was high overall, the role of the position of the task was smaller than for the CCST 2 (attrition for the booklet 2B, where Lakes was the next to last task, was 23 % vs. 19 % depending on whether the booklet was the first or the second for the students). For attrition in the other tasks, the order in which students did the booklets or in which the tasks were in the booklets, had less bearing. Apparently, in the somewhat longer Booklet 2 many students have been pressured by time, independent of booklet order. The apparent impact of the order in which the booklets were done is entwined with differences in general attainment between the countries, as the share of students from different countries varies in the groups, mainly due to all students in Italy and Slovenia beginning with Booklet 1 (type A or B). Among the other six countries, the share of students beginning with Booklet 2B was around 25 % in Austria, Finland, France and Spain, and around 20 % in Cyprus and Portugal. Task- and item-level attrition is reflected in students responses to the question in the end of each booklet concerning the sufficiency of the allocated time (Table 10).
Table 10 Sufficiency of time to answer all the questions (item 8b) according to booklet order

BOOKLET ORDER 1A followed by 2A 1A followed by 2B 1B followed by 2B 1B followed by 2A 2A followed by 1A 2A followed by 1B 2B followed by 1B 2B followed by 1A

Booklet 1 17 % 13 % 15 % 12 % 10 % 12 % 7% 15 %

Booklet 2 13 % 5% 10 % 11 % 38 % 37 % 40 % 48 %

18

As a separate time was allocated for answering this end-of-booklet questionnaire and almost all students answered its questions, the share of students commenting on the insufficiency of time can be expected to reflect the real situation fairly well. However, the number of students stating the time was not sufficient is bigger than that of non-answered items, possibly indicating that some students have not bee able to give the last tasks the attention their full attention, leading to a smaller percentage of correct answers in those tasks.

3 Results
In this chapter, we will present the results of the pre-pilot in the eight participating countries, covering the three domains of the instrument, the cognitive, the affective and the metacognitive component (subchapters 3.1, 3.2 and 3.3, respectively), together with students views on the test. Full statistical data on the basic analyses are presented in the Appendices.

3.1 The cognitive domain


The results of the cognitive component of the pre-pilot test will be presented in this chapter both according to the original data with its 2325 students and by applying analyses-appropriate means to counter-balance the effect of the differing sample sizes in the results 5 . In some analyses, this is done by means of simple weighing while in others the total number of students has been multiplied by ten after which a random sample of 600 students have been drawn to represent each country.

All task level descriptives and analyses are based on data were missing values (i.e., value for a missing task, not for missing items) have been replaced using as reference the mean attainment in that task of students of the same country, gender and cognitive Z mean for all the tasks they have accomplished.

19

Reporting of the results is divided to four subchapters. First, basic task-level data will be reported, looking at the mean level of attainment (percentage of correctly solved items), group differences based on country, gender, home background, and attainment level, and the reliability of the original scales. Second, structural equation modelling (AMOS) will be use to analyse the fit of the data at construct level with the conceptual model of the cognitive component of the learning to learn framework. Third, the functioning and difficulty of individual items will be analysed using Rasch modelling (IRT), including analyses of relative item difficulty and deviations of these in the different countries. Related to this, the different tasks will be discussed individually, and propositions for revising the scales and the test will be made to improve the fit between the conceptual framework and the actual competences of the studied age cohort. Finally, the results of the different analyses will be drawn together to conclude the cognitive subchapter. The order in which the booklets were presented to the students, as well as the version students had of the two booklets (even if only differing in the CCST task as being either the first or the last of the cognitive tasks), seems to have had some effect on students performance in the tasks. However, as there were between-country differences in the share of students doing the two versions of the two booklets in different order will not be analysed separately due to the difficulty of disentangling the impact of the order in which students did the booklets from the influence of other factors 6 .

3.1.1 Results according to the original scales To maximally profit from all available information in the evaluation of the functioning of the cognitive domain of the test, the five-item cognitive task that formed part of the metacognitive component (metacognitive monitoring) has been analysed together with the other cognitive tasks. The tests will be referred to with names CCST, LagSev, Choco, Text, Meta and Lakes, and listed in the tables according to their appearance in the two booklets.
6

In all tasks, between-country differences were clearly bigger than differences between the groups based on the type and order of booklet (ANOVA F[7/4723]=14-58 vs. F[7/4715]=10-26, all p<.001).

20

As could be seen already in the country reports, there were clear differences between the six cognitive tasks in the test as to their difficulty to the students and partially reflecting but also independently of this the reliability or inner coherence of the tasks (Table 11). However, it is to be kept in mind that it is not clear how uni-dimensional cognitive scales should ideally be. This might be especially true in the field of learning to learn where the measured competence might comprise different sub-dimensions that do not necessarily get fostered in the same order in all curricula.
Table 11 Percentage of correct answers in the cognitive tasks 7
ORIGINAL DATA CCST LAGSEV CHOCO TEXT META LAKES
8

WEIGHTED DATA Mean 59 % 47 % 49 % 22 % 43 % 63 % 49 % SD 20.7 30.8 23.4 28.4 18.1 26.9 20.7 .67 .75 .79 .79 .57 .47 .16 N 4731 4731 4731 4731 4731 4731 4731 Mean 58 % 47 % 48 % 19 % 42 % 62 % 48 % SD 20.4 30.7 22.1 26.5 17.5 27.0 20.8

N 2324 2324 2324 2324 2324 2324 2324

.68 .76 .81 .74 .60 .47 .15

CHOCO TASK

Among the cognitive task, the Meta task and the CCST task were clearly the easiest for students, while the Piagetian task Choco, measuring students ability to understand the role or effect of changing variables in a comparison, was the most difficult though only at task level. The last remark is warranted as, unlike the other tasks that comprised just individual items, the Choco task comprised six subtasks, each in turn comprising three or four individual items. Reflecting this extra difficulty, even if the mean percentage of correct answers for the items was 48 %, a full 40 % of students could solve none of the

Both the original and the weighted data are given to indicate the impact for the reliability of the scales and the mean level of students attainment of the differing number of students participating in the pre-pilot in the eight countries. As expected, the effect of weighting is especially salient in comparisons between countries. 8 To match the other tasks, the Chocolate task has here been regarded at item level. The reliability for task level is the same as for item level despite the difference in number of items in the reliability analyses (6 vs. 21) but the mean level of performance is clearly lower (percentage of correct answers 22%/19 % vs. 49%/48 % for the unweighted/weighted data).

21

six subtasks correctly 9 . Percentage of students answering correctly the items in the different tasks task are shown in Figure 1.

1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 CCST 1 2E CCST 2 2H CCST 1 2H CCST 2 2D CCST 2 2E CCST 1 2D CCST 1 2C CCST 2 2G CCST 1 2G CCST 2 2A CCST 2 2C CCST 2 2F CCST 2 2B CCST 1 2B CCST 1 2F 0,00 CCST 1 2A

1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 LAGSEV 3C LAGSEV 3E LAGSEV 3A
CHOCO 4A CHOCO 4B

CCST
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 CHOCO 4E3 CHOCO 4F2 CHOCO 4F3 CHOCO 4E1 CHOCO 4E2 CHOCO 4E4 CHOCO 4F1 CHOCO 4D2 CHOCO 4D4 CHOCO 4D3 CHOCO 4F4 CHOCO 4C2 CHOCO 4C1 CHOCO 4C3 CHOCO 4B3 CHOCO 4B2 CHOCO 4A3 CHOCO 4A1 CHOCO 4A2 CHOCO 4D1 CHOCO 4B1 0,00

LAGSEV
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 CHOCO 4C CHOCO 4D CHOCO 4E CHOCO 4F 0,00

CHOCO BY ITEM
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 TEXT 13 TEXT 12 TEXT 11 TEXT 16 TEXT 15 TEXT 14 TEXT 10 TEXT 7 TEXT 8 TEXT 4 TEXT 6 TEXT 3 TEXT 9 TEXT 1 TEXT 2 TEXT 5

CHOCO BY PROBLEM
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 META 2 META 1 META 3 META 5 META 4
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 LAKE2C LAKE1C LAKE2B LAKE1A LAKE1B LAKE2A

TEXT Figure 1

LAGSEV 3D

LAGSEV 3B

LAGSEV 3F

META

LAKES

Percentage of students answering correctly the items of the different tasks

On the other hand, due to this two-layered structure of items and problems, the Choco task is fairly impervious to guessing, indicating that students who do get a problem (whole task) correct, most probably really master the level of thinking it measures.

22

Not only were the tasks of differing level of difficulty but there were also differences between the participating countries in students attainment in them (Figure 2). However, the differences were not or at least not directly related to the difficulty of the task but were both largest and smallest in the two easiest tasks, the CCST and the Meta task (ANOVA, F=58 vs. F=14, respectively, both p<.001). In the first, the differences were more evenly distributed among the countries whereas in the latter, the Finnish students differed with their better performance from those of the other countries, among whom differences were relatively small. Compared to this difference, the Finnish students performance in the LagSev task, at the mean level of all the countries, demonstrates well the difference between the two mathematics-related tasks in the test, the more school-like arithmetic Meta task and the LagSev task which has been distanced from the simple computations of math classes to probe students understanding of the functioning of the rules behind the everyday arithmetic operators.

100 90 80 70 60 50 40 30 20 10 0 CCST LAGSEV CHOCO TEXT META LAKES Austria Cyprus Finland France Italy Portugal Slovenia Spain

Figure 2

Percentage of correct answers in the cognitive tasks by country

The attainment of boys and girls differed in only some of the tasks, and the differences in the reliability of the tests by gender were relatively small (Table 12). Girls outperformed boys in the CCST task while boys did the same in the Meta task. However, there were differences between the countries regarding the tasks with (and the magnitude of) gender differences, so no clear conclusion concerning a possible gender bias in the test can be 23

easily made. Likewise, there were differences between the performance of students from urban and rural schools in some of the tasks over the whole sample (the former performing better in the Text task and the latter in the Meta and Choco tasks, F=20 vs. F=15 and F=7, respectively) but, as with gender, there were differences between the countries as to the tasks and the magnitude of these differences.
Table 12 Percentage of correct answers in the cognitive tasks by gender

BOYS CCST LAGSEV CHOCO TEXT META LAKES


10

N = 2239 Mean 56 % 46 % 49 % 42 % 63 % 48 % SD 20.6 31.3 22.4 17.4 27.4 21.0

GIRLS .67 .74 .77 .56 .43 .14

N = 2461 Mean 59 % 49 % 48 % 43 % 60 % 48 % SD 20.1 30.2 21.9 17.6 26.4 20.7

Diff F 16 7 7 16

.67 .77 .79 .57 .51 .18

Differences related to students home background were obvious, even if also these varied by country. Over the whole sample, the impact of parents education was greatest in the CCST and Meta tasks (F=78/81 and F=42/59 for mother/father) even if this was not the case in all countries (most notably, in Spain, mothers education had no impact on students attainment in the CCST task). Differences based on the first language spoken (a dichotomy of the language of instruction or other) were in the CCST and Meta tasks of the same magnitude as differences related to parental education (F=86 and F=58, respectively) but there were smaller differences also in the other tasks (F=27 to F=11). Students performance in the different tasks clearly reflects their school achievement (GPA), regardless of some of them, especially the Text task, being clearly too difficult even for many students in the highest GPA quartile (Table 13).

To better match the other tasks in terms of the impact of guessing, the Chocolate task has been regarded here at item level. However, the reliability at task level is the same as at item level despite the difference in number of items (6 vs. 21) but the mean level of performance is clearly lower at task level (percentage of correct answers 19 % vs. 48 %).

10

24

Table 13

Percentage of correct answers in the cognitive tasks by GPA quartiles

Lowest 25 % Mean CCST LAGSEV CHOCO TEXT META LAKES 46 % 33 % 42 % 39 % 50 % 43 % SD 19.1 28.1 18.4 15.5 25.7 18.6

Mid-low 25 % Mean 54 % 40 % 44 % 39 % 55 % 44 % SD 18.0 28.8 18.7 16.9 26.4 21.4

Mid-high 25 % Mean 61 % 53 % 48 % 43 % 67 % 50 % SD 19.4 29.7 21.4 17.4 25.2 21.1

Highest 25 % Mean 69 % 62 % 59 % 49 % 75 % 54 % SD 17.5 28.7 25.2 18.6 23.3 19.8

Even if better achieving students on average perform better in the cognitive learning to learn tasks, the test as a whole and the different tasks separately maybe excluding the Meta task clearly measure something beyond mere school attainment. The fairly low correlations between students results in the different tasks seem to imply that they succeed in measuring relatively independent cognitive dimensions, all related to each other and to students school performance (and to their home background) but none succeeding to explain even 20 % of the variation in each other or in students school achievement (Table 14).
Table 14 Correlations between students performance in the different cognitive tasks, test sum and with school achievement (country-specific quartiles) *
CHOCO i 0.35 0.28 CHOCO t 0.42 0.35 0.82 TEXT 0.30 0.22 0.21 0.25 META 0.39 0.35 0.32 0.37 0.18 LAKES 0.28 0.23 0.20 0.25 0.15 0.24 TEST I 0.72 0.66 0.57 0.69 0.55 0.66 0.56 TEST II 0.73 0.67 0.57 0.70 0.59 0.47 0.58 0.97 GPA 0.40 0.35 0.27 0.34 0.19 0.35 0.19 0.47 0.45
* CHOCO i = item level, CHOCO t = task level, TEST I = with Meta task TEST II = without meta task

LAGSEV CCST LAGSEV CHOCO i CHOCO t TEXT META LAKES TEST I TEST II 0.38

25

Despite the relatively low mean percentage of correct answers, several of the tasks were too easy for some students. Of the 4731 students (weighted data), 753 solved correct all the items in the Meta task, 625 all the items in the LagSev task, 40 all the items in the CCST task, 34 all the items in the Choco task and 9 all the items in the Lakes task. In the Text task, the best result was 94 % of correct answers, meaning one incorrect item among the 16, obtained by 23 students, many of them getting 100 % correct in one or several of the other tasks. No student towered above all the others in getting the full hand but a fourth of the students (1222, weighted data) solved all items (or 15/16 in the Text task) correctly in at least one task, 218 in two tasks, 21 in three, and one student in four tasks. The number of students performing at the top level in at least one task varied between 131 and 202 by country, meaning that in each country, every fifth student could solve correctly all items in at least one task. Despite girls outnumbering boys in the sample (2466 vs. 2239) there were slightly more boys than girls among these top performers (618 vs. 595). Reliability of the scales As was seen in Table 10, many of the original tasks proved lacking in internal coherence, with the Lakes, Meta and Text tasks showing especially low reliabilities (Cronbachs alpha < .60). Also the .67 reliability of the CCST scale is just due to the high number of items in it, with only four of the 16 items showing an item-total correlation of .40 or higher. Accordingly, only the Choco and LagSev tasks can be seen to present in their present form adequately reliable scales 11 . The reliability of the cognitive scales also differed by country. The LagSev task formed a very reliable scale in all countries (Cronbachs alpha .71-.76) despite differences in the mean level of students performance in it (from 53 % to 39 %, in France and Slovenia, respectively). Instead, the reliability varied between .52 and .71 in the CCST scale (Portugal vs. Finland and Slovenia, respectively), between .60 and .88 in the Choco task
In the Choco task, at the sub-task level, this high reliability reflects the low probability of getting a whole sub-task correct by guessing, while five of the six subtasks actually measure the same scheme of reasoning. The sixth sub-task (3C) the most difficult in the two booklets measures a more advanced scheme, and is solved only by a few, leading to a very low item-total correlation (.10).
11

26

at item level and .58 and .82 at subtask level (Cyprus vs. Finland, respectively), between .36 and .69 in the Text task (Portugal and Finland, respectively) and between .17 and .61 in the Meta task (Spain and Slovakia, respectively). There were differences between the countries in the reliability of the Lakes task, too, but in no country did its reliability rise above .26. Put together, the five cognitive scales form a relatively reliable test (Cronbachs alpha .67 12 ). However, for two of the tasks (Text and Lakes) the item-total correlation is below .40, and even for the other three (four with the Meta task) it stays below .55. Viewed at item level, the reliability of the instrument raises to .86 but this can be seen as an outcome of the large number of items involved, with only nine out of the seventy items having a scale-total correlation above .40. This lacking unidimensionality is also apparent in efforts to apply factor analysis over the totality of the items. A simple principle component analysis (Eigen value > 1, no rotation) produces 22 factors that explain together 59 % of the variance, with difficult-to-interpret factors based rather on item type and item difficulty than substantial content. An exploratory factor analysis (Maximum likelihood, Varimax rotation, Eigen value > 1) finds no local minimum with 100 iterations and, when enforcing six factors to match the six original scales, only the six items in LagSev load on a clear factor of its own (only loadings over .30 have been registered), and eight of the 16 items of the Text task on another. Of the other four factors, three comprise a selection of the Choco items while the fourth combine the rest of the Choco items with some contingent items from the CCST, Text, Meta and Lakes tasks (even from the LagSev, with a negative secondary loading). Eighteen of the 70 items do not load on any of the six factors above the set limit of .30. There is no reason to think that the cognitive dimension of learning to learn would or should be uni-dimensional; after all, the goal in building the instrument was expressly to cover the (alleged) diverse cognitive sub-dimensions contributing to the formation of learning to learn. Accordingly, the reliability of the test as a whole should not be seen as a hindrance or a drawback, even if the reliability of the individual scales might be seen to
12

Without the Meta task which is originally not a part of the cognitive component of the test, the reliability is .63.

27

require further work to gain better reliability. The coherence of the cognitive component will be viewed in the next subchapter using structural equation modelling (AMOS).

3.1.2 The L2L cognitive construct(s) The cognitive learning to learn constructs, identifying a proposition (A), using rules (B), testing rules and propositions (C), and using mental tools (D) were somewhat less reliable than the tasks that formed their core (Table 15) 13 . In each of the three scales A, B, and C where parts from the CCST were included, the item-total correlation of these additional items was below .40. This is partially due to the low internal reliability of the CCST items themselves, partially to their low correlation with the other items in the respective constructs.
Table 15 Basic descriptives for the Learning to Learn cognitive constructs 14

A IDENTIFYING A PROPOSITION B USING RULES C1 TESTING RULES AND PROPOSITIONS C2 TESTING RULES AND PROPOSITIONS D USING MENTAL TOOLS .63 .67 .78 .71 .15

Mean 47 % 51 % 47 % 33% 48 %

SD 16.6 24.3 22.4 22.4 21.8

To better study the internal coherence of the cognitive part of the test (or the relative weight of the different dimensions in the overall cognitive component), structural equation modelling (AMOS) was used. Due to the lacking internal consistency of the sub-domains A, B and C, evident in reliability studies and confirmed with a test model, the CCST items in each of the three constructs A, B and C were introduced to the model as independent constructs, as were the two types of subtasks in Choco, the pre-assigned and the self-built comparisons (Figure 3).

The descriptive statistics are given as means of percentages of corrected answers in the different scales even if this is not quite warranted due to differences in the difficulty of the original scales inside them. 14 In C1, the Choco task has been calculated in reliability and mean at item level, in C2 by the six subtasks.

13

28

text.ident ccst.ident LAGSEV ccst.using


,49 ,59 ,62 ,34

e1 e2 e3 e4 e5 e6 e7 e8

,43 ,65 ,50 ,69

L to L

choco.AC choco.DF ccst.testing LAKE

Figure 3

Model 1

Despite the allowances made, the fit of the data to the model was still poor. Accordingly, a new model (Model 2) was built, allowing for shared error variance to the components (Figure 4). Unlike the previous attempt, the Model 2 showed good fit with the data (Chi square = 17.496, degrees of freedom = 11, probability level =.094).

text.ident ccst.ident LAGSEV ccst.using


,38 ,50 ,67 ,32

e1 e2 e3 e4 e5
,35 ,12 ,16 ,14 ,05 -,12

,45 ,68 ,44 ,72

L to L

choco.AC choco.DF ccst.testing LAKE

e6
,07

e7 e8

,10 -,07

Figure 4

Model 2

29

Even if there is no shared error variance between the three CCST units, their low reliabilities (.50 for the construct A, .29 and .33, respectively, for the constructs B and C) and their lack of shared error variance with the core tasks in their respective constructs seemed to give little reason to keep the CCST test divided into the three units. Accordingly, a third model was built with the CCST test allowed to form a dimension of its own seen to be well justified by the clearly differing, applied nature of the items, coming close to the concept of practical intelligence, for example (c.f., Sternberg, 1999). Even if the model reduces all the components of the cognitive domain to be measured by just one task, it seems to present a justifiable whole to cover the different dimensions of the original conceptualisation of learning to learn in the European framework, adding to it the fifth, more applied dimension, presented by the CCST task (Figure 5).

text
,42 ,62

e1 e2 e3 e4 e5
,12

choco
,63 ,39 ,57

L to L

ccst LAKE LAGSEV

Figure 5

Model 3

The fit of the Model 3 is good (Chi square = 8.055, degrees of freedom = 4, probability level =.090), justifying the acceptance of the model as a basis for reconsidering the sufficiency of the original scales to measure the five dimensions of the cognitive component of learning to learn.

3.1.3 Item-level analysis of the cognitive component The size of the compiled data allowed the use of item response theory (IRT) to look in more detail at the functioning of individual items in the test. As the weighted data can be

30

assumed to better represent the full eight grade student body of the European Union, analyses presented in this chapter will be based on it. However, for each country to be able to assess the implications of a possible bias in their own sample compared to a more representative national sample, the distribution of students and items according to the original data is presented in Appendix 2, together with the more detailed item-level data for the weighted data. On average, a cognitive item was answered correctly by 53 % of students, and in all but the Text task there was at least one item that two thirds of students did answer correctly. However, there were clear differences between the tasks in terms of their distribution of items along the whole range of difficulty of the test. While items which were solved correctly by 53 % of students (the average for the whole test) formed 80 % of the items in the Meta task and close to 70 % in the CCST and Choco tasks, their share was only 33 % in the LagSev and Lakes tasks and a mere 6 % in the Text task (see Figure 1). Also, even if 67 % of the items in the Choco task were solved correctly by the majority of students, none of the subtasks was solved by more than 30 % of them and the one hard subtask was solved correctly by just 2 %. Already this rudimentary comparison reveals that not only do the different tasks measure fairly independent components of the cognitive domain, as shown in the previous subchapters, but that they also measure their respective areas at different levels. This, of course, is a problem from the point of view of a balanced indicator. The Rasch analyses will be first presented by a short overview of the test and task level results. After that, each task will be discussed in more detail, looking at the relative ease or difficulty of the individual items to the compiled student sample and to the students of the different participating countries 15 . Some suggestions will also be made for actions to be taken to reach a better fit between the items and the targeted student population, taking into account also the earlier presented wish to further shorten the test. The relative difficulty of all the items in the test (Rasch scale) is shown in Figure 6.

Country-level differences in student-item fit might indicate inaccuracies in translation but they might also be due to differences in curriculum or other cultural factors pertinent to some of the items.

15

31

-1

-2

| | X| | | | | | XXX| XX|30 X| XXXXX|29 XXXX| XXXXXXX|19 56 XXXXXXXXX| XXXXXXX| XXXXXXXXX|70 XXXXXXXXX| XXXXXXXXXXXX|21 50 51 XXXXXXXXXXXXXX|65 XXXXXXXXXXXXXXX|47 XXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX|55 XXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXX|16 49 XXXXXXXXXXXXXXXXXXXXXXXXXXX|8 25 31 69 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|14 28 46 54 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|12 15 18 22 23 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|27 44 45 48 52 59 61 67 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|58 60 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|40 53 57 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|24 33 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|5 26 38 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|41 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|7 10 35 42 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|2 13 34 36 37 XXXXXXXXXXXXXXXXXXX|17 39 XXXXXXXXXXXXXXXXXXX|4 43 XXXXXXXXXXXXXXXXXXXX|3 62 XXXXXXXXXXXXX| XXXXXXXX|32 XXXXXXXX|6 9 64 XXXXX|11 XXX|63 XX|20 XX| XX|1 X|66 X| | |68 | | | CCST LAGSEV CHOCO TEXT META LAKE

Colour code for the items:

One X stands for 7.3 students. For the list of the individual item numbers in the different tasks, see Appendix 2. Figure 6 Item and student distribution, weighted data (IRT, ConQuest)

32

The high norm level (0) of the Rasch scale shows the test to have been somewhat too difficult for the sampled students. As it is, only 47 % of students could solve correctly the items falling at the norm level while the performance of 60 % falls below it. Also, not only are there differences in the way the different tasks cover or dont cover the full difficulty range of the test but the items do not cover the range evenly even as a whole, forming gaps in the student-item fit. This is especially true for the more able students but also for the average and weaker students. Only the LagSev task offers items at a fairly full range of difficulty, reflected in the evenly increasing percentage of correct answers (see Figure 1). Two of the items were easy for most students (one even too easy), two represent the (somewhat too high) norm level, while the last two give challenge even for the better performing students. Of the other tasks, the CCST clearly lacks items at the difficult end while providing unnecessary many of the same level(s) of difficulty in three clusters along the range between (almost too) easy to slightly difficult. The Choco task, instead, while also providing unnecessary many items at the mid-low to mid-difficult level, comprises two items that, if both are solved correctly, succeed to single out in themselves the few very top students 16 . The Text task clearly presents unnecessary many items at the mid-difficult to difficult range, while almost half of the students are left outside of its reach due to the items being too difficult or the text too long to read with the attention required for answering. The items of the Meta task (which is not conceptually part of the cognitive component) fall into two groups, three being too easy to differentiate but among the very weakest students, and two at the mid to mid-difficult level. Like in Choco, the items in the Lakes task fall acutely into two groups. Two of the items are too easy, which was due to a mistake in the information given in the stimulus of the task, leading to two of the three choices having to be accepted as correct. The rest of the six items, instead, cover fairly evenly the upper range of difficulty.

Even the weakest of the 34 students who solved correctly all the Choco items performed one standard deviation above mean in the total test score, while the mean for this high performing group was +2.15, highest for the groups attaining full points in one of the tasks.

16

33

IRT models were also estimated for each weighted country data separately. Comparing the outcome of these with the model for the compiled data allows indicating items with country-level bias. For each item, the difference between the actual attainment of students in a country is compared, to their performance expectation in it, based on the mean test performance of the students in relation to the whole student sample and the mean for the item in question across countries (DIF) 17 . The summary for items with possible country bias or otherwise poor fit is presented in Table 16 18 .
Table 16 Number of items with poor country fit (DIF < .40 easy, or DIF > .40 difficult) For items with MNSQ fit < -2 or > 2 the model does not fit the data
No. of items easier than expected No country 1 country 2 countries 3 countries 4 countries 5 countries 6 countries 7 countries Total 33 7 9 14 2 1 70 No. of items harder than expected 49 13 8 70 Total No. of items with DIF 16 20 17 14 2 1 70 No. of items with MNSQ fit < -2 or >2 35 16 6 4 3 4 2 70 No. of items with MNSQ fit < -2 48 12 2 1 2 3 1 70 No. of items with MNSQ fit > 2 55 4 4 3 1 1 1 70

The largest single group among items with poor fit in at least one country were 16 of the 21 Choco items, including all items in the three pre-assigned comparisons. One of the two items showing poor fit in seven of the eight of the eight countries was the first item of the Choco task. Also the first items in LagSev, Meta and Lakes tasks showed poorer fit than any or most of the other items in the tasks, maybe indicating that students had difficulties in settling into new types of tasks maybe an indicator that could be worth analysing as part of the metacognitive component?
For a full list of the DIF indices, see Appendix 2. As the samples are not representative, the bias might not express actual country bias but may be due to differences in the respective sample make-ups. However, to ensure a fair functioning of the test, probable reasons for the country-level DIF should be searched, including translation and language.
18 17

34

When items worked contrary to the expectation in a country, it was more often for students performing better than their overall attainment would have predicted for that item. In three of the tasks, CCST, Text and Lakes, students performed more often better than expected (about 56 % of cases), while in LagSev, Choco and especially in Meta, students performance was more often below their expected level (56 %, 62 % and 71 % of the cases). Still, maybe reflecting the relative easiness of the CCST items, biggest deviations toward weaker performance came up in five of the CCST items. In Table 17, the number of items with DIF under -.40 or over .40 is given for each country for each task to offer a more detailed picture of the relative functioning of the tasks. Reflecting these against the IRT model presented in Figure 6, preliminary recommendations will be given for the further development of the test.
Table 17 Number of items per task with poor country fit (DIF < .40 easy or DIF > .40 difficult) in the participating countries (es=easier, diff= more difficult)
CCST es Austria Cyprus Finland France Italy Portugal Slovenia Spain Total 1 1 4 3 5 3 2 19 diff 2 5 1 2 2 1 2 15 LAGSEV es 1 1 1 3 diff 2 1 2 5 CHOCO es 1 2 diff 3 2 4 2 5 16 TEXT es 3 1 2 5 11 diff 1 1 2 1 1 3 9 META es 1 1 2 diff 2 2 1 5 LAKES es 1 2 2 3 8 diff 3 1 1 1 6 total 8 18 14 15 11 14 13 17 110

3 1 3 10

CCST

The CCST task differs from the other cognitive tasks in some crucial ways. First, unlike the others, it presents many problems to which there are no objectively correct answers in the same sense than there is a correct answer to the items in the LagSev or Meta tasks, for example. And, while also in the Text task students are invited to evaluate the relative

35

weight or importance of the presented statements in view of the text they have read, the criteria for a correct answer are different in the Text and CCST tasks. Namely, while in the Text tasks, the criteria are based on the reading of the text by educated young adult readers (university students), the reference group for the CCST task comprises students of the same age as the pre-pilot sample. Hence, it is not always clear whether an incorrect answer represents less or more advanced reasoning vis--vis the presented dilemma. In some of the items, an incorrect answer might also be due to cultural differences and hence should not necessarily be considered incorrect. Also, the examples used to introduce each new pair of questions in the task, which might have been welcome in the original longer version of the test, were now found superfluous and even distracting by students in all countries. However, the same students might not have come to think that the stories with the pictures might have contributed to the CCST having been found to be the easiest and the most interesting among the cognitive tasks. The results of the Rasch analysis, presented in Figure 6, and the subsequent DIF indices offer a clear guideline for shortening the CCST task by reducing the number of items at the mid-low and mid-high ranges of difficulty. However, the analyses are not of help in providing new, more difficult items to extend the coverage of the task to the full extent of the difficulty range of the test. For this, it would be necessary to go back to the original test for further items, referenced with older students. In most countries, students performance in the items with DIF >.40 or <-.40 was in some items better and in some weaker than their overall level of performance would have predicted. The CCST items generated clearly less stable results among the Italian and Cypriot students while in Spain and Austria, only one or two CCTS item elicited unexpected results 19 . In no item, however, did students of more than three countries perform against expectation. The five CCST items at the upper-mid difficulty range worked in a more balanced way across the countries, with only a few cases of high DIF
Overall, the Austrian students seem to have been most stable or predictable in their performance with only seven items reaching the DIF threshold of .40 while the number of such items in Cyprus, with a fairly similar level of attainment, was eighteen.
19

36

indices. This might, however, just reflect the relative difficulty of these five items in the test that was fairly difficult overall. There were also differences in the fit of the items related to gender, with all the five items with T <.-20 or >.20 favouring boys. However, even if all the items with poor fit either for gender or in at least two countries are left out, the remaining CCST items still do not form a reliable scale (Cronbachs alpha .43, no item-total correlation above .25). Accordingly, even if six of the present CCST items have been proposed to be kept in the hypothetical Rasch model in Figure 7, a new set of items forming an internally more consistent whole would be welcomed. A proposition to cut the number of items to 10, while expanding the coverage of the task in terms of difficulty, is probably contrary to the premises of the original Dutch test. Already the pre-piloted version was a compromise, comprising only a portion of the original items. However, the need to foreshorten the test while securing for each tasks to measure their respective cognitive dimension along the full range of variation in students cognitive competence have been given priority at this point of the project. Besides, as none of the two pairs of items forms a (statistically) coherent unit, there seems to be no overriding reason to think of the test and its items as measuring the separate latent traits referred to by the names of the pairs of items, but measuring as whole something that could be called practical or applied reasoning, well justifying its place within the conceptual frame of learning to learn.
LAGSEV

Among the cognitive tasks, the LagSev task, measuring students ability to step a short distance beyond the basic rules of arithmetic and apply them to in themselves easy arithmetic tasks, worked exceptionally well, offering challenges for students of all levels of attainment. Among the items, one produced results contrary to expectations in three countries (3E), two items did so in two countries (3C and 3D), one in one (3A) and one in none (3B). However, unlike in the CCST, in only one item did the DIF index pass the limit of .75, due to the relative easiness of item 3D to the French students.

37

There were, however, considerable gender-related differences in the relative ease or difficulty of the items, with especially items 3A, 3B and 3E, all at the mid to mid-high range of difficulty, proving to be clearly relatively more difficult for girls than for boys (DIF .6.1, 5.6 and 2.8, respectively). This either reflected or was the reason for these three items also showing highest DIF indices for the whole sample, exceeding slightly even those for gender. However, unlike in the CCST task, there is good reason to believe these differences to rather reflect real differences in the kind or reasoning required in the tasks than cultural or other bias. After all, the reference point for the analyses is based on the interaction of the items of differing difficulty with students total test score in the test, with few of the tasks demanding same level of rigorousness of logical thinking. The relative difficulty of some of the items is most probably due to the novel situation of students having to make by themselves a decision concerning the application of one of the basic tenets of arithmetic, the order in which to execute the different arithmetic operations, without clear rules to follow. Am I expected to postulate brackets here or not? However, the two items where the correct answer depended on this decision were very uniformly solved by students who performed well in the test as a whole 20 . Some modification would be warranted, however, to further balance the difficulty of the items both at the mid-difficult and the mid-easy levels. Proposing to lengthen the task to the same number of items as the other tasks (10), new items have been added to fill the aforementioned gaps while the easiest item (3D) has been changed to a slightly more challenging one, to better cover the weaker end of attainment (Figure 7).
CHOCO

The Choco task has clearly posed problems for many students. In the country reports, one of the reasons was seen to be the hard-to-read layout of the task. However, there is reason to suspect that at least part of the difficulty lies not only in the layout but in the tasks increased level of abstraction compared to the original FILLS task, caused by the change
The mean performance in the whole test for the 625 students (weighted data) who solved correctly all the six LagSev items was 1.07 standard deviations above the mean as compared to the 0.77 of the approximately as many students solving correctly all five items in the Meta task.
20

38

from the sphere of cars and drivers to that of strains, fermentation and dehydration. But, whatever the reasons for the relatively high item and task-level attrition, the distribution of the Choco items in the Rasch model reveals an inbuilt problem in the structure of the task, hidden by but also reason for its high reliability at both item and task level. Two of the items are clearly too difficult for all but the very few top thinkers while the rest form loosely two clusters along the total scale of difficulty, divided in compliance with the quality of the variables to be compared (samesame, differentdifferent), not the level of reasoning the respective subtask is set to measure 21 . The basic problem in the tasks is that all the five easier subtasks are solved using the one and same principle. If the variable stays the same > you cannot deduce the effect; if the variable has been changed > you can deduce the effect. I.e., the level of thinking required is the same for all the five easier subtasks as they measure the same form of reasoning or rule testing, and differences in results are either due to the difference between the preassigned tasks and the self-made pairs (of which the former are more difficult as they also offer the distracting choice of maybe and requiring the students to keep the whole set of variables in mind at the same time) or to the consistency with which students apply this one principle; undoubtedly an important component of thinking 22 . The relative independence of the two item types is visible in the fairly high correlations between the tasks of the same type (r=.64-.69), compared to the lower correlation between the preassigned and the self-made pairs (r=.33-.40). Except for the two difficult items in subtask 4C, there are no DIF items in the Choco scale with a DIF value above .65. Overall, however, a great majority of the Choco items show high DIF indices and even excepting the one difficult subtask more often indicating a weaker than expected performance. The result is reflected in (or reflects) the

Oddly enough, it seems to be more difficult for students to understand that you cannot deduce the effect of a variable when it stays the same, compared to the case where it is changed. But, as mentioned above, only 2 % of students understood that the latter is only true if you change just one of the variables at a time. 22 At task level, the mean is higher for the pre-assigned tasks but this is due to the smaller number of items in them, i.e., fewer possibilities to choose an incorrect choice.

21

39

difference in the results of boys and girls, with girls profiting more of the (relatively) easier Choco items. Despite the problems revealed by the analysis presented above, the Choco task can be argued to tap a crucial aspect in students ability to apply rigorous thinking (and perseverance) in a task that is novel to them in both setting and context, and not easily decipherable (What does this measure?). With the one difficult subtask having succeeded in finding responding minds in a few of the students in mind, a new task with a more varied selection of subtasks might be construed to cover more evenly students competences in abstract thinking. What is needed (on side of a more user-friendly layout), is a reworking of the task to offer a more varied array of comparisons for rule or proposition testing, maybe in a somewhat less abstract setting and offering the student the possibility to discover or learn (or not to learn) the crux of the concept of variable effect when advancing from item to item. Accordingly, even if the task itself has to be changed, items of one of each type of the Choco subtask has been left in Figure 7 to work as proxies for items in the new Choco with additional items marked to cover the difficulty gaps left by the present task. As it will be hard to keep the number of items in the task in pair with the more linearly built cognitive tasks if its basic structure is to stay as it is, the number of proposed items (15) exceeds that of the two previous tasks, representing five three-item subtasks.
TEXT

The task measuring reading comprehension was relatively hard for students. This was to be expected as the correct interpretation of the text (operationalised as the relative importance of the item statements) is referenced on educated young adults, while the task was now presented to still developing 14-year old readers. This has lead to many of the items ending up posing a fairly similar level challenge to students at the mid-to-upper difficulty level, while not offering much (or any) power to differentiate among the weaker students.

40

Only one of the Text items functioned poorly in more than two countries and even in that item, students in two countries performed above and in three countries below their respective expected levels. Also, all of the 23 students solving 15 of the 16 items in the task correctly (the best attainment in the task) agreed with the official correct answer which, accordingly, can be interpreted even in view of the present data as the more mature interpretation. In Spain, five of the Text items elicited a performance above the expected level and none below it, whereas in no other country did students performance differ from the expected in so many items and when it did, their performance varied both above and below the expected level. To better serve as an indicator for this age group, new items would be needed to cover the lower half of attainment while the number of items measuring the now covered range of difficulty could easily be reduced. Likewise, students complaint of the text being too long could be responded to by basing the task on a somewhat shorter text. Accordingly, like for the Choco task, no actual items can be recommended to stay in the test but some are left in Figure 7 as proxies for new items, together with three additional items to form a 10-item scale with two items for the category topic and four items each for the categories main information and trivia.
META

Also the metacognitive monitoring task has been included in the Rasch analysis as part of the cognitive measurements in the learning to learn test. Three of the five items have been fairly easy for the students while the first two have posed more difficulties, even if still staying below the norm level 0. There were very few high DIF items in the Meta task except in Portugal where, for some reason, three of the five items worked against expectation. However, as one of them was easier than expected and two more difficult, the result might be due to either translation or to differences in curriculum. As the Meta task is not conceptually part of the cognitive component, the items have been left out from the Figure 7 proxy model.

41

LAKES

The Lakes task turned out to be problematic in many ways. It was found to be hard to understand, have a difficult layout, have too much redundant text and, by accident, it even ended up in the Finnish Booklets to be laid out in a way that produced further inconvenience for the students. Some of the problems are due to the haste to find a task to replace the problem solving task originally intended for the set, which would have required too much time to fit in the desired time frame. Due the haste, two of the six items in the task (foreshortened from a longer one) were too ambiguous for only one of the three choices to be accepted as correct, leading to an unintended high percentage of correct answers by just guessing. The other four items were of middle to high difficulty, leading to the task as a whole covering the range of students ability fairly unevenly. Due to the poor functioning of the task there seems to be no reason to try to build a new task based on the Lakes task but to look for an entirely new problem solving task to address students readiness to explore and choose information relevant for solving the presented problem in a complex setting, preferably using both textual, numerical, and pictorial information. In Figure 7, a hypothetical distribution of the items of the new problem solving task is presented, based partially on the items of the Lakes tasks, to represent a better fit between the students and the task. As problem solving can be seen to measure simultaneously all the different competences covered by the other tasks, the five (presumably in all aspects more demanding) items of the task are presented to be on average somewhat more difficult than those of the other tasks.

3.1.4 Conclusions regarding the cognitive component 23 In the framework for the pre-pilot test, the cognitive component was operationalised as comprising four dimensions or constructs, described as identifying a proposition, using

In the conclusions, the Meta task, even if analysed as part of the cognitive component above, will not be discussed due to it not forming part of the cognitive component of the test (see subchapter 3.3 on metacognition).

23

42

rules, testing rules and propositions, and using mental tools. Of these, the last was operationalised to be measured by just one task on problem solving, while the others were covered by one independent task each together with additional items taken from the Dutch test for Cross-Curricular skills, CCST. Both reliability analyses and structural equation modelling confirmed, however, that the adding of the CCTS items to the three constructs was not well founded. Instead, the analyses supported an understanding of the CCST task as a fifth dimension in the cognitive component, measuring something that could be described as applying reasoning in age-appropriate practical problems. In the analyses above, results have been reported on each item, task and dimension of the cognitive component, using varied statistical methods. Some of the tasks have turned out to work fairly well for the studied age group of students in the eight participating European countries, while some have shown apparent shortcomings in this respect. There have been clear differences between the tasks as to their range of level of measured difficulty, with only one task, LagSev, measuring the construct B using rules covering fairly evenly, even if too sparsely, maybe, due to the mere six items in it, the whole range of the students competence. The CCST task revealed itself to be too easy to differentiate across the whole student body while the opposite was true of the two other tasks, Chocolate and Lakes, which were clearly too difficult for the majority of students in all countries. Accordingly, it can be said that the analyses of the compiled data allow for the following recommendations for the next step of the learning to learn indicator project: 1. Retaining the basic framework for the cognitive component, with the exception of removing the CCST items from their current constructs to form a new construct that could be called reasoning in everyday situations. 2. Revising the different tasks on the basis of the empirical data provided by the prepilot. For this, the following steps are recommended to be taken in the different tasks (presented here, like in the analyses above, in the order of their appearance in the two test booklets). As an all-embracing principle, the number of items in

43

the different tasks is recommended to be made more equal, to facilitate the balancing of the instrument between the five cognitive dimensions it measures. 3. In the CCST task, several recommendations are made, due to both the low reliability of the task, its relative easiness for the students in question (its lacking discrimination among the more able students), and students comments concerning the layout of the tasks. The key among these are: a) Reducing the number of items in the task to a more coherent and unidimensional task with just 10 items. b) Adding new items to cover the more difficult half of the attainment distribution. c) Dispensing with the introductory stories to the items which students found confusing 4. In the LagSev task, it is recommended to increase the number of items to 10 to cover more evenly the full distribution of student attainment, and to facilitate comparisons with the competencies measured by the other constructs. 5. The Choco task is recommended to forgo a fundamental modification. The measuring of students thinking with a task that can be anchored to the Piagetian understanding of formal operational thinking is seen as an asset. However, the present task shows many failings even in this respect discussed more closely in the previous analyses. To fully profit from the idea behind the task, at least the following should be taken into account when rebuilding a new task to substitute for the present one: a) Revising the item structure of the task to more evenly cover the cognitive object of the task, i.e., students understanding of the role and effect of variable change in comparisons. b) Resetting the task in a context that would be more accessible and familiar to students as compared to the difficult terms imbedded in the present task, unnecessary from the point of view of the thinking required. c) Renewing the layout of the task so that the features central to the cognitive processes to be measured would be more clear and understandable to students.

44

d) Ensuring that the revised test covers more evenly the range of student attainment, offering items at all levels of difficulty. 6. The Text task, measuring students skills in evaluative reading, proved to lack items easy enough to discriminate between students at and under the mid-level of attainment. The text that the items were based on, was also seen by many students to be simply too long. Accordingly, the following are recommended to improve the task: a) Transferring the basic structure of the task to a somewhat shorter explorative text. b) Reducing the number of items to cohere with those of the other tasks but allowing for the variation of the three levels of propositions (for example, 4+4+2 instead of the current 8+6+2).
7.

The task for measuring problem solving, Lakes, turned out to be problematic in several respects. The task generated very uneven results, probably as much due to a mistake in the provided information and the layout of the task, which many students found hard to read. As the objective of the cognitive component of the test is not, however, to measure students ability to solve familiar school-like tasks but to measure their readiness to tackle and solve cognitive challenges that differ from the traditional ones, the argument for oddity need not be accepted as the decisive one to sink a task. Unfortunately, in the Lakes task, the main problems lie in the task itself. Accordingly, it is recommended that a totally new problem solving task will be chosen or tailored for the purpose of the indicator. The task should be built on the premise that to solve the problems (items) in the task, students are to look for and choose information among material presented in the task in different formats, including text, graphs, and tables.

In Figure 7, a hypothetical IRT model is presented as a guideline for the next phase of the cognitive component of the test, based on the current IRT scaling of the cognitive items in Figure 6. The objective of the model is to show a hoped-for more balanced distribution of the items of the different tasks along the present distribution of student proficiency.

45

The model is built combining four types of items: a) Items in the present tasks that have worked sufficiently well in the prepilot (CCST and LagSev) b) New items added to present tasks (CCST and LagSev) c) New items for tasks revised on the bases of present ones (Choco and Text) d) New items for a task to replace a present task (Lakes). While maintaining the number of cognitive tasks in five, the number of items would be reduced from the current 65 items (excluding the Meta task) to 50. These would comprise: 10 items each in the CCST, LagSev, and Text tasks, to allow for better coverage of the whole distribution of student proficiency and to better suit comparisons between the different tasks; 15 items in the new task to replace Choco, to allow for a task structure equal to the current one, but covering more evenly the total difficulty range of the test; 5 items in the new problem solving task to replace the current Lakes task, allowed to be somewhat biased toward the more difficult end to reflect the character of the task combining the different cognitive subdomains measured by the other tasks.

46

-1

-2

| X| | | | | | XXX| XX|30 X| XXXXX|29 XXXX| XXXXXXX|19 XXXXXXXXX| XXXXXXX|t4 p5 XXXXXXXXX|c4 n5 XXXXXXXXX| XXXXXXXXXXXX|21 51 XXXXXXXXXXXXXX|n4 c3 XXXXXXXXXXXXXXX|p4 XXXXXXXXXXXXXXXXXXXX|g5 XXXXXXXXXXXXXXXX|55 n3 XXXXXXXXXXXXXXXXXXX|g4 c2 p3 XXXXXXXXXXXXXXXXXXXXX|49 XXXXXXXXXXXXXXXXXXXXXXXXXXX|31 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|14 28 54 p2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|15 18 22 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|27 48 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|n2 g3 p1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|53 c1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|33 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|5 26 t3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|g2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|10 35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 34 t2 XXXXXXXXXXXXXXXXXXX|17 XXXXXXXXXXXXXXXXXXX|4 XXXXXXXXXXXXXXXXXXXX|n1 XXXXXXXXXXXXX|t1 g1 XXXXXXXX|32 XXXXXXXX| XXXXX|11 XXX| XX| XX| XX| X| X| | | | | | CCST + c1-c4 LAG + g1-g5 CHOCO + n1-n5 TEXT+t1-t4 p1-p5 for LAKES

Colour code:

One X stands for 7.3 students. Figure 7 Proposed item difficulty distributions for the cognitive tasks in a new pilot test, based on the cognitive tasks and items of the pre-pilot

47

3.2 The affective domain


Results of the affective component of the test will be presented both according to the original data with its 2325 students and by applying analyses-appropriate means to counter-balance the effect of the differing sample sizes in the results. Due to the relatively low item and scale-level attrition no replacement of missing values has been done in the affective domain (2255 of the 2306 students who answered at least one item in one scale responded to items in all scales). In the first subchapter, the affective component will be looked at according to the original and/or slightly revised scales 24 . After that, the results will be analysed and discussed according to the conceptual structure of the three constructs of learning motivation, learning strategies and orientation toward change (A), academic self-concept and selfesteem (B) and learning environment (C). In the third subchapter, the individual items of the affective scales will be analysed using Rasch-modelling. Lastly, some conclusions and recommendations for the next phase will be made, concerning the affective component of the indicator.

3.2.1 Results according to the original scales


The affective domain comprises original scales of clearly two sorts the internally more versatile British Learning Power Scales (ELLI) with 3 to 8 items each, and the theoretically bound and contentually more rigorous Finnish scales with just two-items each. However, the decision to cut the FILLS scales to just two items each, done in the eagerness to cut testing time, was clearly a mistake and a return to (at least) the original three items per scale is recommended. Due to the different number of items and the
In the analyses, some of the original ELLI scales have been divided to two separate scales, based on careful reading of the items and the lacking reliability of the original scales.
24

48

nature of the scales it reflects, analysing the functioning of the two types of scales and comparing their reliabilities is a somewhat disparate enterprise. While the FILLS scales either have a sufficient reliability or not, in many ELLI scales, neither a critical reading of the items nor the reliability of the scale support their inner consistency. Hence, in the scale level results and analyses presented below, some of the ELLI scales have been revised by leaving out items with a very low (or even negative) item-total correlation, while some scales have been divided to two subscales to achieve scales with a better inner coherence. The viability of the original scales was first looked through principal component analysis, directed separately to the items in the two booklets 25 . For Booklet 1, the analysis yielded 11 factors explaining 51 % of the total variance. More than half of the items loaded primarily on the first factor, which alone explained 19 % of the total variance and 8 % of students school achievement. As the analysis offers little support for even a relative independence of other than a few of the FILLS scales, another factor analysis was executed (Maximal likelihood with Varimax rotation, yielding 11 factors explaining 38 % of the total variance). In this model, 15 of the 49 items loaded primarily on the first factor, explaining 8 % of the total variance and just 3% of the GPA. Ten of the items loading primarily on the first factor were from different subscales in ELLI, together with two full scales and one lone item from the FILLS. However, the rotated solution yielded several of the FILLS scales in fairly pure form, with correlations between the factors and GPA ranging from .00 to .33. The results were very similar for the items in Booklet 2. Principal component analyses yielded 12 factors explaining 59 % of the variance, with 19 of the 40 items loading primarily on the first factor. Only some of the more precise FILLS scales on academic self-concept and self-esteem loaded primarily on factors of their own. A rotated factor solution (Maximal likelihood with Varimax rotation; 12 factors explaining 44 % of the variance and just 1 % of the GPA) yielded most of the original scales in a pure form,
In connection to the later analyses concerning the conceptual L2L affective domains, however, it has become apparent that some of the ELLI scales in Booklet 2 should have been analysed together with the items of Booklet 1.
25

49

excepting the first factor which combined the two group work scales together with the items of the learning with friends scale from ELLI. The rotated solutions, with some repeated step-wise factorings, was accepted as the basis to see to what extent the original scales would be supported as independent constructs. Most of the ELLI scales did not separate themselves in their original form despite the repeated factorings which led, through careful (re)reading of the items and through subsequent reliability analyses, to their restructuring by discarding a few items with exceptionally poor item-total correlations and dividing some of the scales to two contentually more coherent scales. This increased the number of scales based on the ELLI items to eleven, compared to the original seven learning power scales. However, even with these manoeuvres, many of the scales continued to have low reliabilities. The reliabilities and means for the affective scales for both the original and the weighted data are presented in Table 18. The items in the different scales together with the basic reliability data (item-total correlations) are presented in Appendix 1.

50

Table 18

Basic descriptives for the affective scales, original and weighted data

ORIGINAL DATA 1A 2A 3A 3B 4A 4B 5A 5B 6A 7A 7B 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Changing and learning Critical curiosity Meaning making assimilation Meaning making sense in own life Creativity trusting imagination Creativity trying new things Learning relationships friends Learning relationships family Strategic awareness getting around problems Fragility and dependence helplessness Fragility and dependence mild helplessness Goal-orientation - Mastery intrinsic Goal-orientation - Mastery extrinsic Goal-orientation - Performance approach Goal-orientation - Performance avoidance Goal-orientation - Avoidance Control motivation Control expectancy Means-Ends - Effort Means-Ends - Ability Means-Ends - Chance Agency - Effort Agency - Ability Academic withdrawal Deep processing Self-concept - Thinking Self-concept - Math Self-concept - Reading Self-esteem Learning environment - School Learning environment - Teachers Perceived support - Parents Perceived support - Peers Group work - Task orientation Group work - Cooperation .71 .61 .66 .57 .61 .72 .59 .65 .61 .54 .47 .67 .76 .46 .49 .60 .63 .68 .55 .49 .74 .63 .73 .62 .49 .52 .83 .76 .65 .78 .71 .64 .75 .61 .65 Mean 3.64 2.92 3.50 4.12 3.71 3.38 3.67 3.66 3.14 2.61 3.15 3.89 4.12 3.22 3.41 3.02 3.95 4.23 4.25 2.16 1.89 3.42 4.01 2.80 3.44 3.66 3.15 3.91 3.89 3.65 3.02 4.19 2.95 3.76 4.04 SD 0.80 0.88 0.78 0.74 0.74 0.85 0.81 1.02 0.75 0.74 0.77 0.81 0.84 0.95 0.94 1.04 0.89 0.77 0.73 0.93 0.95 0.91 0.84 1.01 0.96 0.79 1.18 1.02 0.90 0.95 1.03 0.74 0.91 0.81 0.75

WEIGHTED DATA .71 .61 .64 .53 .60 .71 .58 .64 .60 .52 .47 .66 .74 .43 .45 .59 .62 .66 .51 .49 .75 .64 .68 .60 .47 .49 .80 .69 .64 .78 .69 .63 .74 .59 .65 Mean 3.68 3.00 3.54 4.17 3.77 3.42 3.72 3.73 3.19 2.58 3.15 3.96 4.16 3.22 3.45 2.94 4.03 4.31 4.29 2.09 1.86 3.48 4.08 2.76 3.54 3.70 3.16 3.87 3.92 3.72 3.07 4.23 3.00 3.80 4.09 SD 0.81 0.89 0.78 0.73 0.74 0.86 0.81 1.00 0.75 0,74 0.79 0.80 0.83 0.96 0.93 1.05 0.88 0.74 0.72 0.93 0.95 0.92 0.81 1.01 0.95 0.78 1.16 1.00 0.89 0.96 1.04 0.74 0.91 0.81 0.74

In Figure 8, the scales are arranged in order of reliability (weighted data) for an overview of the differences between the scales in terms of their inner consistency. As can be seen, the reliability of most of the scales is not very high, with just seven scales presenting

51

reliabilities above =.70 (Cronbach) and twelve below the level =.60. The low reliabilities of the scales will be discussed further after an overview of group differences in scale means and reliabilities.

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 S e lf- c o n c e p t - M a th s L e a r n in g e n v ir o n m e n t - S c h o o l M eans -Ends - C hanc e G o a l- o r ie n ta tio n - M a s te r y e x tr in s ic P e r c e iv e d s u p p o r t - P e e r s C h a n g in g a n d le a r n in g 1 A C r e a tiv ity 4 B S e lf- c o n c e p t - R e a d in g L e a r n in g e n v ir o n m e n t - T e a c h e r s A g e n c y - A b ility G o a l- o r ie n ta tio n - M a s te r y in tr in s ic C o n tr o l e x p e c ta n c y G r o u p w o r k - C o o p e r a tio n M e a n in g m a k in g 3 A L e a r n in g r e la tio n s h ip s 5 B - fa m ily A g e n c y - E ffo r t S e lf- e s te e m P e r c e iv e d s u p p o r t - P a r e n ts C o n tr o l m o tiv a tio n C r itic a l c u r io s ity 2 A C r e a tiv ity 4 A S tr a te g ic a w a r e n e s s 6 A A c a d e m ic w ith d r a w a l G o a l- o r ie n ta tio n - A v o id a n c e G r o u p w o r k - T a s k o r ie n ta tio n L e a r n in g r e la tio n s h ip s 5 A - fr ie n d s M e a n in g m a k in g 3 B F r a g ility a n d d e p e n d e n c e 7 A M e a n s - E n d s - E ffo r t M e a n s - E n d s - A b ility S e lf- c o n c e p t - T h in k in g F r a g ility a n d d e p e n d e n c e 7 B D e e p p r o c e s s in g G o a l- o r ie n ta tio n - P e r fo r m a n c e a v o id a n c e G o a l- o r ie n ta tio n - P e r fo r m a n c e a p p r o a c h

alpha

Figure 8

Affective scales in order of reliability (Cronbachs alpha)

52

Group differences Many of the scales portray gender differences both in scale means and in reliability. As to the level of students affective attitudes, gender differences are most notable in avoidance orientation, mild helplessness and cooperative group work (F>100) but highly significant also in the scale for learning relationships: friends, deep processing, task orientation in group work, and math self-concept (F>50). Overall, gender difference exceeds the level F=20 (ANOVA) in roughly half of the scales, mainly in accordance with earlier research. Boys expressed more confidence both in their own abilities and in the role of ability in explaining achievement but, maybe reflecting the latter, also a stronger tendency to avoid school-related effort, while girls expressed greater readiness to orient their work and effort to activities related to school and achievement, including cooperation with their friends, while simultaneously showing greater tendency for succumbing to (mild) helplessness (Table 19). In fifteen of the 35 scales, gender difference in the reliability of the scale was .05 or bigger. Some of the scales were more reliable for boys, some for girls. As several of the scales with a gender difference were not very reliable over the whole sample, the difference as well as a possible difference in scale mean might be as contingent as the scale itself. In some of the scales which were relatively reliable over the whole sample, the difference is big enough to warrant a recommendation to look more closely at the individual items in the scales (e.g., the reliability of the scale goal orientation: masteryintrinsic is =.70 for boys but only =.63 for girls). However, as the decision to cut the FILLS affective scales to two items was clearly a mistake in the first hand, some of the problems might be cleared by just returning to the original three item scales or even extending the number of items in every scale to four.

53

Table 19

Affective scales, mean values for boys and girls (red with F > 20)
BOYS Mean SD 0.83 0.89 0.79 0.76 0.75 0.88 0.82 0.99 0.77 0.74 0.79 0.85 0.90 0.98 0.95 1.03 0.90 0.78 0.74 0.95 0.99 0.93 0.81 1.01 0.98 0.76 1.13 1.00 0.86 0.99 1.04 0.76 0.93 0.83 0.77 GIRLS Mean 3.68 2.94 3.57 4.22 3.80 3.47 3.82 3.79 3.22 2.59 3.26 3.99 4.21 3.18 3.46 2.78 4.09 4.34 4.30 2.02 1.81 3.56 4.02 2.77 3.66 3.65 3.04 3.93 3.83 3.78 3.10 4.25 3.06 3.89 4.20 SD 0.79 0.88 0.76 0.69 0.73 0.83 0.78 1.00 0.73 0.74 0.76 0.75 0.76 0.95 0.91 1.04 0.84 0.71 0.69 0.91 0.91 0.90 0.81 1.00 0.90 0.80 1.17 0.99 0.91 0.92 1.03 0.72 0.89 0.78 0.69

1A 2A 3A 3B 4A 4B 5A 5B 6A 7A 7B 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Changing and learning Critical curiosity Meaning making assimilation Meaning making sense in own life Creativity trusting imagination Creativity trying new things Learning relationships friends Learning relationships family Strategic awareness getting around problems Fragility and dependence helplessness Fragility and dependence mild helplessness Goal-orientation - Mastery intrinsic Goal-orientation - Mastery extrinsic Goal-orientation - Performance approach Goal-orientation - Performance avoidance Goal-orientation - Avoidance Control motivation Control expectancy Means-Ends - Effort Means-Ends - Ability Means-Ends - Chance Agency - Effort Agency - Ability Academic withdrawal Deep processing Self-concept - Thinking Self-concept - Math Self-concept - Reading Self-esteem Learning environment - School Learning environment - Teachers Perceived support - Parents Perceived support - Peers Group work - Task orientation Group work - Cooperation

3.68 3.06 3.50 4.10 3.72 3.37 3.60 3.67 3.16 2.56 3.02 3.93 4.09 3.26 3.43 3.10 3.95 4.27 4.27 2.17 1.91 3.39 4.14 2.75 3.40 3.76 3.29 3.79 4.01 3.65 3.05 4.19 2.93 3.70 3.98

54

Not only did boys and girls differ in their responses to the affective items but there were also statistically significant (p<.001), even if less acute, between-country differences in the means of all scales. The differences were most prominent in students expressed views concerning the importance and interest of things learnt at school, the role of chance in (school) success, critical curiosity, mastery-intrinsic goal orientation, and deep processing (all F>50). There were also considerable between-country differences in scale reliability, suggesting a need for further research in item choice and wording, including the application of more systematic back-translation, to secure an acceptable level of consistency in the working of the different dimensions. Also parents education, regional factors, and first language spoken were shown to have an impact on students views. Over the whole sample, there was not much difference in the magnitude of the impact of mothers and fathers education on their childs affective attitudes, except for a slightly greater impact of mothers education on students belief in their own ability but also on their tendency for academic withdrawal. Between students from urban and rural schools, only the difference in views related to working with friends was statistically significant above level F=20, with rural students expressing a greater willingness to work with and share their learning activities with friends. However, as girls were slightly overrepresented in the rural sample (57 % rural vs. 52 % total), the difference might be an artefact or, at least, be smaller with a more balanced sample. There were somewhat bigger differences along the other regional variable, even if these, too, might just reflect biases in the sample, with the Italian and Portuguese students in the small coastal group showing a tendency for more extreme use of the Likert scale. The same seems to be true of students with a first language other than that of instruction 26 . Students self-concept as a reader varied also somewhat by age (the older the eight graders were, the less ready they were to see themselves as good readers) while the younger students also agreed more strongly with their parents valuing education and in the role of chance in explaining success. However, as the mean difference of age varied

The interpretation is based on the two groups seeming to have a tendency the use of the extreme values of the Likert scale not only in the socially desirable looking items but over the whole set of items.

26

55

by country, also these slight differences between the four age quartiles might be just an artefact due to sampling. The affective component vs. students test attainment and GPA From the point of view of the (alleged) role of the affective component in learning to learn, the most interesting and important differences might be those related to students school achievement and to their attainment in the cognitive part of the test. This does not mean that measuring learning to learn would just aim at predicting students current or even future school achievement. However, in the pre-pilot, students (past) school achievement was the only reference available to act as a proxy for the objective to measure students readiness to use their cognitive capacities to meet and solve the different learning tasks that come their way in life, be it at school, at work, or in leisure time. Regarding both the cognitive component of the test and the GPA (operationalised in the pre-pilot as national quartiles), some of the affective scales proved fairly powerful in differentiating between students, while some clearly measure affective factors which are fairly independent of either (Table 20). One interpretation of these differences or (to give a more exact name to the phenomenon) of the correlation between the measured affective factors and students test results or their GPA, could be as a profile of a good thinker / learner (test attainment) or a good student (GPA).The difference between these two could be interpreted as the added value the affective attitude measured by the scale gives to a student performing in the test at a given level 27 . Exceptions to this interpretation are scales related to students academic self-concept, as these can be understood to rather reflect earlier success in school, even if they might also play a role in lowering the threshold for students in accepting new and more demanding learning challenges.

By default, this interpretation is based on an assumption of students having tried their best in the test, which, of course, might not be the case, and might differ by country, gender, and attainment level. However, as the objective of the whole endeavor was to measure students willingness to tackle cognitive challenges put to them, there seems to be no choice but to accept the results as what the students would do even in a real life situation. After all, demands set to them in and by school are real for this age group.

27

56

Table 20

Significance of difference (ANOVA) in students mean levels on the different affective scales for TEST attainment and GPA quartiles
TEST F Sig. 0.00 0.00 0.00 0.00 0.00 ns ns ns 5.03 132.45 17.24 19.00 14.71 0.00 0.00 0.00 0.00 0.00 ns ns 12.33 27.49 34.73 9.28 38.62 108.42 6.00 94.48 71.44 11.59 21.75 73.26 66.74 12.24 5.04 28.60 40.68 16.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ns 0.00 0.00 0.00 ns 0.00 0.00 5.24 2.54 44.52 171.04 32.54 23.28 78.12 8.97 3.48 25.26 66.92 59.69 4.08 11.34 34.66 96.88 159.58 146.65 18.03 98.42 246.85 63.06 5.61 12.96 8.00 23.76 8.22 45.77 17.66 GPA F 59.90 32.28 33.31 32.34 12.44 Sig. 0.00 0.00 0.00 0.00 0.00 ns 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

AFFECTIVE SCALES 1A 2A 3A 3B 4A 4B 5A 5B 6A 7A 7B 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Changing and learning Critical curiosity Meaning making assimilation Meaning making sense in own life Creativity trusting imagination Creativity trying new things Learning relationships friends Learning relationships family Strategic awareness getting around problems Fragility and dependence helplessness Fragility and dependence mild helplessness Goal-orientation - Mastery intrinsic Goal-orientation - Mastery extrinsic Goal-orientation - Performance approach Goal-orientation - Performance avoidance Goal-orientation - Avoidance Control motivation Control expectancy Means-Ends - Effort Means-Ends - Ability Means-Ends - Chance Agency - Effort Agency - Ability Academic withdrawal Deep processing Self-concept - Thinking Self-concept - Math Self-concept - Reading Self-esteem Learning environment - School Learning environment - Teachers Perceived support - Parents Perceived support - Peers Group work - Task orientation Group work - Cooperation

9.05 8.54 27.62 26.75 11.99

57

To make the intertwined relation of the two relations visible, the correlation of the affective scales to test attainment and GPA are shown in Figure 9, arranged according to their correlation with the GPA; i.e., offering an affective profile for a good student.

Strategic aw arenes s - getting around problem s M eaning m ak ing - as s im ilation M eaning m ak ing - s ens e in ow n life G oal-orientation - m as tery intrins ic Perc eiv ed s upport - parents G roup w ork - c ooperation D eep proc es s ing C ritic al c urios ity Learning env ironm ent - s c hool C reativ ity - trus ting im agination G oal-orientation - perform anc e approac h Learning relations hips - friends Learning env ironm ent - teac hers M eans -Ends - effort Self-es teem Learning relations hips - fam ily C reativ ity - try ing new things G oal-orientation - perform anc e av oidanc e Perc eiv ed s upport - peers M eans -Ends - ability G oal-orientation - av oidanc e F ragility and dependenc e - m ild helples s nes s M eans -Ends - c hanc e Ac adem ic w ithdraw al F ragility and dependenc e - helples s nes s

Figure 9

-0,60

G oal-orientation - m as tery ex trins ic C ontrol m otiv ation Self-c onc ept - reading C ontrol ex pec tanc y C hanging and learning G roup w ork - tas k orientation

-0,40

Self-c onc ept - m ath Agenc y - ability Self-c onc ept - think ing Agenc y - effort

Correlations between the affective scales and students test attainment and GPA quartiles

-0,20 T EST

0,00 GPA TEST

0,20

0,40

0,60

58

Among the different dimensions measured in the learning to learn test, students performance in the cognitive tasks is the best predictor for their school attainment. The next best predictors or more probably reflectors are students ability-related selfconcepts. Among the scales which are not so much related to feedback as to students own future-oriented action, most strongly correlated with school achievement are agency: effort (readiness to apply own effort to school work), mastery-extrinsic goal orientation (personal importance of doing well) and control motivation (willingness for self-regulated learning; r=.24, r=.22 and r=.20, respectively). These can be seen to measure attitudes most directly contributing to new learning, regardless students earlier level of attainment. However, the affective component did not only comprise factors favourable to new learning. This means that the scales at the negative end of the correlation distribution should be looked at with equal apprehension. And, echoing the results of earlier assessments combining diverse affective factors with school attainment, some scales at the negative end turn out to be even better predictors for or reflectors of poor achievement than even the most influential positive attitudes are for good achievement. Among these detrimental or learning-impeding attitudinal factors, the most salient are fragility and dependence: helplessness and academic withdrawal (r=-.32 and r=-.29, respectively). The difference between the curves for the correlation of the different affective scales to test attainment and GPA, respectively, can be seen as an approximation for the added value each measured dimensions gives to achievement vis--vis students cognitive competence. However, there are differences in the role of the different factors related to gender, and to level of attainment (test or school), as well as differences (in all of these) between the countries. Some of these might be related to differences in the functioning of the scales and in the mean levels of students (expressed) attitudes, some to differences in what is appreciated and rewarded at school n the different countries.

59

3.2.2 The L2L affective constructs In the conceptual model for the European Union indicator project, the affective domain has been seen to comprise three relatively independent constructs, A Learning motivation, learning strategies and orientation toward change, B Academic self-concept and self-esteem, and C Learning environment. In this subchapter, the results of the prepilot will be cross-examined in light of this model, meaning that also the viability of the model will be discussed, based on the empirical evidence of the collected eight country data. Two of the three affective constructs learning motivation, learning strategies and orientation toward change (A) and learning environment (C), seemed to form at the outset fairly reliable scales, whereas the third, academic self-concept and self-esteem (B), proved not to do so (Table 21).
Table 21 Reliabilities of the three affective L2L domain scales at item level

A Learning motivation, learning strategies , and orientation toward change B Academic self-concept and self- esteem C Learning environment .86 .67 .81

However, inside all three there are items and subscales that do not fit well in their respective constructs, some even presenting negative item-total correlations. In domain A, this is especially true of all the scales measuring attitudes detrimental to learning and task-acceptance, even if also many other items present problems from the point of view of the inner coherence of the construct. The problem with the negative items points to a need to study the items measuring learning-impeding beliefs and attitudes as a dimension of its own, (semi)independent of the learning supporting dimension. Also, there seems to be little reason to have included the fragility scale of ELLI (now in two separate scales, based on the original), and the scale for creativity, in domain C learning environment but to see them as an integral part of the positive (creativity: trying new things) and negative (fragility: helplessness) parts of scale A. Also, as some of the scales included

60

in the original construct A were clearly based on students valuation of their abilities (agency: ability and control expectancy), they have been moved to the self-concept related construct B. As a result, a new scale, comprising 10 of the original scales with a total of 22-items has been constructed for the positive dimension of the domain A, learning motivation, learning strategies and orientation toward change (Table 22). The scale is highly reliable (Cronbachs alpha .88) with item-total correlations between .38 and .64 28 . Instead, the items measuring attitudes and beliefs detrimental to learning and success, while posing an acceptably high Cronbachs alpha over all the items (.72), does not present a very coherent scale with nearly all item-total correlations below r=.40, largely reflecting the poor reliability of the original scales (Table 23). Yet, as the detrimental dimension seems to play at least as strong a role in students attainment as their learning supportive attitudes, it would be desirable to find a more coherent set of items for its measuring, to give it a weight comparable to the learning supporting dimension A.

The .38 correlation has been accepted to keep the original scales intact or, in case of the originally longer ELLI scales, to comprise at least two items each, reflecting the shorter FILLS scales.

28

61

Table 22

Revised L2L Affective construct A Learning motivation, learning strategies and orientation toward change / supportive. Scales in order of the strongest item-total correlation of the items in the scale
Itemtotal =.66 0.51 0.64 =.64 0.46 0.56 =.71 0.51 0.54 =.64 0.48 0.52 =.58 0.52 0.46 0.44 =.57 0.52 0.42 =.62 0.50 0.38 =.70 0.48 0.43 =.74 0.40 0.48 =.47 0.42 0.47

A1

LEARNING MOTIVATION, LEARNING STRATEGIES AND ORIENTATION TOWARD CHANGE / POSITIVE DIMENSION

Goal orientation mastery-intrinsic 1 An important goal at school is to acquire new knowledge 23 To learn as much as possible is an important goal for me at school. Agency beliefs - effort 4 I put sufficient effort to my school work. 36 I work hard to succeed at school. Changing and learning 12 I 'm continually improving as a learner. 46 I can feel myself improving as a learner. Meaning making assimilation 3 I make connections between new things and things I already know. 35 I make connections between what I am learning and what I've learned before. Strategic awareness 19 When I find learning boring I can usually make it interesting. 24 I have ways of making myself learn if I don't feel like learning. 45 If I get stuck with a task I can usually get round the problem. Critical curiosity 2 When learning is hard, I tend to find it interesting. 26 I like it when I have to try really hard to understand something. Control motivation 18 If I get a problem wrong, I want to know where I made the mistake. 22 When I get a problem wrong, I want to know where I made the mistake. Creativity trying new things 37 I like to try out new learning in different ways. 39 I feel it's OK to try different things out in my learning. Goal orientation mastery-extrinsic 42 Getting good marks at school is important for me. 49 For me, an important goal is to do well at school. Deep processing 7 When preparing for exams, I stop to turn things over in my mind 39 When I study, I make up questions for myself to see I've really understood.

62

Table 23

Revised L2L Affective construct A Learning motivation, learning strategies and orientation toward change / detrimental. Scales in order of the strongest item-total correlation of the items in the scale
Itemtotal =.60 0.45 0.44 =.52 0.28 0.36 0.38 0.37 =.75 0.30 0.37 =.47 0.20 0.34 0.22 =.59 0.26 0.29 =.49 0.24 0.29 =.45 0.13 0.21

A2

LEARNING MOTIVATION, LEARNING STRATEGIES AND ORIENTATION TOWARD CHANGE / NEGATIVE DIMENSION

Academic withdrawal 28 R I give up easily if my assignments look too demanding. 30 R Concentrating on difficult tasks is hard for me. Fragility and dependence / helplessness 20 R When learning is hard, it's because I didn't have enough help. 24 R I avoid trying to learn new things because I don't like feeling confused & uncertain. 26 R When I have to struggle to learn it's probably because I'm not very clever. 32 R Sometimes I don't know what I am to do until I see my friends getting on with it. Means-ends-beliefs / chance 15 R Failure at school is mainly due to bad luck. 27 R Success at school is a matter of luck. Fragility and dependence / mild helplessness 2 R When I have trouble learning something I tend to get upset. 4 R When I'm stuck I don't usually know what to do about it. 12 R When not able to master something, it's because I don't know how to go about it. Goal orientation avoidance 9 R I have no interest in doing anything extra for school. 20 R I only do the compulsory work for school, nothing more. Means-ends-beliefs / ability 11 R If one fails at school it just shows that one is not smart enough. 40 R Poor marks are due to lack of ability. Goal orientation performance avoidance 25 R I try to avoid situations in which I might fail or make a mistake. 48 R In class I try to avoid situations where the others might think I am dumb.

Omitting the culturally somewhat ambivalent scale performance avoidance and the already in itself less reliable scale fragility, a more coherent construct could be formed even if it still would lack inner coherence (Cronbachs alpha .69, lowest item-total correlation .24). After all, the basic problem is the weak reliability of the original scales measuring negative affects, pointing to a need for careful rethinking of the subconstructs if the dimension of detrimental attitudes is to be included in the instrument.

63

In construct B Academic self-concept and self-esteem, the inclusion of general selfesteem on side of the clearly ability-related self-concepts causes a problem. This is no surprise, as in earlier studies, self-concept has been shown to also have a reparative role as a buffer between students perceptions of outside expectations and their level of experienced success in different spheres of life, leading to low or even negative correlations between the two in some subgroups and some spheres, especially the school. Correlations between students self-concept as reader and in mathematics are not very high either, causing low item-total correlations in the construct even after the omission of self-esteem. To achieve a relatively broad indicator for students views concerning their own abilities and attainment, a revised construct B1 (Cronbachs alpha .74) has been proposed to be built on the base of the three academic self-concept scales and the aforementioned two ability-related scales from the current domain A (Table 22). Additionally, a separate subconstruct for subjective well-being (B1), represented in the current instrument by just the two items of the scale self-esteem, has been included in the affective Construct B.

64

Table 24

Revised L2L Affective construct B Academic self-concept. Scales in order of the strongest item-total correlation of the items in the scale
Itemtotal =.74 =.68 0.51 0.54 =.49 0.52 0.34 =.80 0.42 0.38 =.66 0.42 0.37 =.69 0.29 0.32 =.64 =.64 .47 .47

B ACADEMIC SELF-CONCEPT AND SELF-ESTEEM


B1 ACADEMIC SELF-CONCEPT Agency / ability 21 I have necessary abilities to do well at school. 31 I am clever enough to do well at school. Academic self-concept / thinking 5 I am smart and perceptive. 13 I am very imaginative; I often come with good ideas. Academic self-concept / mathematics 3 I usually handle even the more difficult maths problems well. 30 I am good in maths. Control expectancy 8 I can get good marks at school if I want to. 34 I can do well at school if I decide to. Academic self-concept / reader 19 Reading is really easy for me. 35 I am a good reader. B2 SUBJECTIVE WELL-BEING / SELF-ESTEEM Self-esteem 27 I have a very positive image of myself. 40 I like being me.

The items in the proposed domain C Learning environment form a highly reliable construct (Cronbachs alpha .81), but with many item-total correlations below .40. Accordingly, it is proposed to be understood as comprising three separate, internally more coherent subcontsructs C1, C2 and C3 (Table 25), representing the three major contextual factors related to young peoples life and learning: school, family and friends or peers (Cronbachs alpha .77, .66, and 61, respectively). Unfortunately, none of the three scales are very stable either in their present state.

65

Table 25

Revised L2L Affective construct C Learning environment. Scales are in order of the strongest item-total correlation of the items in the scale
Itemtotal =.77 =.78 0.55 0.60 =.65 0.43 0.47 =.69 0.46 0.43 =.59 0.42 0.42 =.66 =.64 0.50 0.44 =.63 0.44 0.41 =.61 =.74 0.41 0.45 =.58 0.38 0.33

C LEARNING ENVIRONMENT
C 1 SCHOOL Learning environments / school 11 I think we learn useful and important things at school. 18 I think we are taught a lot of interesting things at school. Group work / cooperation orientation 6 When doing group work I am ready to help the others in the group. 15 When doing group work I listen to everyone's ideas. Learning environments / teachers 21 I think our teachers really pay attention to pupils' own ideas. 33 I think our teachers are just and fair. Group work / task orientation 9 In group work I am active in discussing ways things should be solved or done. 17 When doing group work I make suggestions as to how we should work. C 2 FAMILY Learning relationships / family 38 I feel that my family is an important source of learning for me. 34 At least one person at home who is an important guide for me in my learning. Perceived support / parents 14 My parents set a high value on trying to learn and understand things. 23 My parents set a high value on school and education. C 2 FRIENDS Perceived support / peers 8 My friends are interested in the things taught in school. 16 My friends value the things we learn at school a lot. Learning relationships / friends 28 Talking things through with my friends helps me to learn. 36 I enjoy discussing difficult problems with my friends.

Even if the affective constructs have to be seen still as only preliminary i.e., based on a preliminary analysis of the fit of the data to the conceptual model they can be accepted as offering a sufficient basis for further analyses of the coherence and adequacy of the affective component of the test. Together, the three revised affective constructs with their

66

respective subconstructs comprise 64 of the 89 items in the two questionnaires, covering 31 of the current 35 scales presented in Table 16. Basic descriptives for the revised constructs A, B and C, are presented in Table 26 (ANOVA; all marked differences are statistically significant at level p.001).
Table 26 Basic descriptives for the scales for the revised affective L2L domains

BOYS A1 SUPPORTIVE A2 DETRIMENTAL B1 SELF-CONCEPT B2 SELF-ESTEEM C1 SCHOOL C2 FAMILY C3 FRIENDS .88 .72 .74 .64 .77 .66 .61 Mean 3.54 2.71 3.85 4.01 3.60 3.93 3.23 SD 0.60 0.48 0.56 0.86 0.65 0.72 0.75

GIRLS Mean 3.63 2.70 3.80 3.83 3.74 4.02 3.42 SD 0.54 0.49 0.58 0.91 0.61 0.72 0.71 F 33 ns 12 46 62 17 83

In light of earlier research, the lack of gender difference in the construct for detrimental attitudes is somewhat surprising. However, it is explained by the compound nature of the scale, with the means for boys and girls in the individual scales annulling each others effects. Within the construct, gender difference is most notable in mild helplessness and avoidance orientation (both F=111), the first being more typical for girls and the latter for boys 29 . Furthermore, while avoidance orientation seems to be mildly but equally detrimental for both boys and girls (correlation with GPA -.10), mild helplessness, when occurring, is more detrimental for boys than girls (correlation with GPA -.20 vs. -.12), probably due to its stronger interconnectedness with other detrimental affective attitudes in boys. This difference is also reflected in the lower correlation between the whole construct A2 and GPA for girls than for boys (-.28 vs. -.35).

The difference is also statistically significant but clearly smaller in means-ends: ability and meansends: chance (F=31 and F=12, respectively), both more typical for boys than girls.

29

67

Between-country differences are biggest in students views concerning school (C1), and in their learning supportive attitudes (A1) (F=78 and F=68, respectively). Especially in the latter, differences between countries are clearly more prominent than differences between boys and girls. There are also between-country differences in students views concerning their parents attitude toward school and the support they experience getting from home, while the role of friends differentiate more strongly along gender lines. The difference, however, is much more prominent in the ELLI scale learning relationships: friends than in the FILLS scale perceived support: peers (F=90 vs. F=22), reflecting the differing nature of the two scales. Differences related to students home background are most prominent in constructs B1 and A2 (school and positive learning attitudes), with the impact of mothers education being a little stronger than that of the father (F=64/39 vs. F=50/24). Of the five revised (sub)construct scales, the positive and negative dimensions of construct A, and the school-related dimension of construct B are clearly related to students school attainment (GPA) and, to a lesser degree, to their performance in the cognitive component of the learning to learn test (Table 27) 30 .
Table 27 Correlations between the affective domains, test performance and GPA

A2 A1 SUPPORTIVE A2 DETRIMENTAL B1 SELF-CONCEPT B2 SELF-ESTEEM C1 SCHOOL C2 FAMILY C3 FRIENDS -0.26

B1 0.44 -0.40

B2 0.25 -0.15 0.32

C1 0.62 -0.25 0.34 0.22

C2 0.41 -0.14 0.28 0.24 0.42

C3 0.48 -0.09 0.22 0.15 0.51 0.32

TEST 0.11 -0.28 0.29 -0.02 0.13 0.06 -0.01

GPA 0.22 -0.32 0.42 0.03 0.14 0.08 0.00

Compared to the correlations between the individual scales and GPA or test performance, the added value of the constructs is not big, endorsing a view that the affective
30

All correlations above r=.06 are statistically significant (p<.001) but even the r=.22 correlation between construct A1 and GPA explains only 5 % of the variance in the latter.

68

component of the instrument could well be foreshortened, using the new revised constructs as a guideline. Also, even if the items in the three original constructs showed very low item-total correlations, those between the revised sub-contracts and their aggregated main constructs A, B and C are high (Table 28).
Table 28 Correlations of the affective sub-constructs to the three main affective constructs of Learning motivation, learning strategies and orientation toward change (A) and Learning environment (C)
A A1 SUPPORTIVE A2 DETRIMENTAL B1 SELF-CONCEPT B2 SELF-ESTEEM C1 SCHOOL C2 FAMILY C3 FRIENDS 0.83 -0.76 0.53 0.26 0.57 0.36 0.38 B 0.39 -0.30 0.71 0.89 0.33 0.31 0.22 C 0.64 -0.20 0.36 0.26 0.80 0.75 0.79

The relatively high correlations are reflected in the structural equation model (Figure 10) constructed on the seven subconstructs (Chi square = 2.261, degrees of freedom = 5, probability level = .812).

CONSTRUCT A1
,81 -,30 ,56

e1 e2
-,30

CONSTRUCT A2 CONSTRUCT B1
,79 ,52 ,60

e3 e4 e5 e6 e7

-,07 ,18 -,20 ,15 ,11 -,19 -,04 ,09

LtoL

,33

CONSTRUCT B2 CONSTRUCT C1 CONSTRUCT C2 CONSTRUCT C3

Figure 10

The affective component of L2L

69

To assure a theoretically justified and from the schools perspective more fruitful further development and revising of the test, it is recommended to begin with the subconstructs which offer more detailed and practically applicable information of the relationship between students attitudes and their actual comportment in their present learning environment, the school. The information lost or confused in the aggregated data, especially for constructs B and C, can bee seen when comparing the correlations between the different affective constructs and students test attainment and GPA (Table 29) to those presented earlier (Table 26).
Table 29 Correlations of the affective constructs A, B and C to students test attainment and GPA.
B A B C 0.44 C 0.55 0.36 TEST 0.23 0.12 0.07 GPA 0.33 0.22 0.09

3.2.3 Item-level analysis of the affective component An IRT model was fitted to the three affective constructs separately, to respect the relative independence of the constructs 31 . Of these, only the data for construct A fit the model to an even tentatively satisfactory degree (Figure 11). What the model reveals, however, is the differing nature of the supportive ELLI and FILLS items, with those of ELLI falling slightly amidst the least detrimental of the items of the negative dimension of construct A2. The same seems to be true regarding some of the negative items, in particular the two FILLS items measuring performance-avoidance orientation (not wanting to appear incompetent) and one of the milder items in the ELLI fragility and dependence scale.

Initially, though, a preliminary IRT analysis was executed over the whole affective domain but the composite of all the affective items did not succeed in forming a coherent base for the model

31

70

| | | | |27 XX|24 |23 | XX| | XX| XXXX|36 XX| XXXX|30 37 XXXX| XXXXXX| XXXXXXXX| XXXXXXXXXX| XXXXXXXXXXXXXX|28 XXXXXXXXXX| XXXXXXXXXXXX|2 12 XXXXXXXXXXXXXX|35 38 XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXX|8 11 22 33 XXXXXXXXXXXXXXXXXX|29 XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX|32 XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX|25 XXXXXXXXXXXXXXXXXX|20 XXXXXXXXXXXXXX|14 XXXXXXXXXXXXXX| XXXXXXXX|5 15 17 31 34 XXXXXXXXXX|26 XXXXXXXX| XXXX|13 XXXX| XXXX|4 21 XX|6 XX|18 XX|3 9 10 | XX| | XX| | |7 | |1 16 19 | |

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Colour code for the items:

A1 Supportive

A2 detrimental

One X stands for 7 students. For the list of the individual item numbers in the constructs, see Appendix 2. Figure 11 Item and student distribution for the affective construct A learning motivation, learning strategies and orientation toward change (IRT, ConQuest)

71

The fit is clearly weaker for constructs B and C, with the norm-level (0) based on the items dividing the student body even more unevenly than was the case for construct A (Figure 12 and Figure 13), rendering the results of the IRT difficult to interpret.
| XX| XX| | XX| XX| XX| XXXX| XXXX| XXXX| XXXX| XXXXXX| XXXXXXXX| XXXXXXXX| XXXXXXXXXXXX| XXXXXXXX| XXXXXXXXXX| XXXXXXXXXXXX| XXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX|5 XXXXXXXXXXXXXXXX|9 XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX|11 XXXXXXXXXXXXXXXX|6 XXXXXXXX|10 XXXXXXXXXX| XXXXXXXXXX|7 XXXXXX|3 8 XXXX| XXXX|12 XX|2 4 XX| XX|1 | XX| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

-1

Colour code for the items:

B1 Academic self-concept

B2 Self-esteem

One X stands for 7 students. For the list of the individual item numbers in the constructs, see Appendix 2. Figure 12 Item and student distribution for the affective construct B Academic selfconcept and Self-esteem (IRT, ConQuest)

72

-1

| XX| | | XX| XX| XXXX| XXXX| XX| XXXX| XXXX| XXXXXXXX| XXXXXXXX| XXXXXXXX| XXXXXXXXXXXX| XXXXXXXXXX| XXXXXXXXXXXX| XXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX|13 XXXXXXXXXXXXXXXXXXXX|8 14 XXXXXXXXXXXXXXXXXX|7 XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXX|16 XXXXXXXXXXXXXXXXXX|6 11 XXXXXXXXXXXXXXXX| XXXXXXXXXX|12 15 XXXXXXXXXX|3 5 XXXXXXXXXX|2 XXXXXX| XXXX|4 XXXXXX| XX|1 9 XX| XX|10 | | | | |

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Colour code for the items:

C1 School C2 Family C3 Friends / peers

One X stands for 6.5 students. For the list of the individual item numbers in the constructs, see Appendix 2. Figure 13 Item and student distribution for the affective construct C Learning environment (IRT, ConQuest)

3.2.4 Conclusions regarding the affective component Probably due to weaknesses in the internal coherence of the subscales and subconstructs in the affective constructs B and C, the models fit the data poorly, resulting in the many low item-total correlations in the reliability analyses, and the IRT models offering little to

73

lean on in improving the affective component of the learning to learn test. Accordingly, decisions concerning the further development of the indicator should rather be based on theoretical arguments, combined with analyses of the correlations between the affective items/scales and students attainment in the test and at school. What can be recommended in all cases, however, is harmonising all the scales in the instrument to three or even four items each. The latter, applied in the next phase, would allow discarding the least fit items once more, leading to all the scales to comprise three items in their final form, to save both testing time and students patience, which seems to be put to test by questions whose relatedness they see but do not understand the (statistical) reason for. For the ELLI scales, it is recommended to apply a more rigorous demand of unidimensionality, if only to be able to better interpret what they portend and, in the future, to be able to offer schools advice in fostering the traits they measure as part of the advancement of students learning to learn competences. As could be seen in the above analyses, some of the affective dimensions measured in the test appeare to be more relevant from the point of view of predicting students comportment in the learning challenges relevant to their lives at the moment. However, as the objective of a learning to learn indicator is to predict the comportment of young people not now, in the school they are currently attending, but later in life, it is not clear to what extent decisions concerning the factors to be measured should be affected by the explanatory power of the test vis--vis their current GPA. This being the case, it can be concluded that the three constructs A, B an d C, Learning motivation (for short), Selfconcept (likewise) and Learning environment, respectively, form an acceptable basis on which to continue the revising and foreshortening of the affective component of the instrument. Based on the reliability analyses of the original and semi-original scales, the further analyses of the three main and seven subconstructs, and the need to reduce the number of scales, aggravated by the need to add the number of items in the scales, the following recommendations will be made:

74

1. The construct A Learning motivation, learning strategies and orientation toward change can be seen to present a relatively simple problem in terms of revising the current indicator, with decisions mainly concerning the number of scales that can be included and their internal coherence. Based on the results presented above, two separate subconstructs are recommended to be formed, representing the positive and the negative dimension, both affecting students readiness to accept and undertake new challenges in learning and (cognitive) development. From amongst the scales present in the current instrument, the following are recommended to be maintained in the instrument for the pilot phase, each revised and enlarged to comprise four items (Table 30).
Table 30 Scales to be retained in the affective construct A Learning motivation, learning strategies and orientation toward change

A1

LEARNING SUPPORTING DIMENSION

Agency belief : effort Goal orientation mastery-extrinsic Control motivation Strategic awareness Meaning making assimilation Goal orientation mastery-intrinsic Critical curiosity Deep processing Creativity readiness to try new things

A2

LEARNING IMPEDING DIMENSION

Fragility and dependence / helplessness Academic withdrawal Means-ends-beliefs / chance Goal orientation avoidance Means-ends-beliefs / ability

However, some of the scales, especially the scales meaning making, strategic awareness, critical curiosity, creativity and fragility and dependence have been included in the construct with the hope that a contentually more consistent

75

scale could be construed to better reflect their respective scale names. Considering the low reliability of many of the scales, they would all benefit from further item adjustment based on a theoretical analysis of their respective content domains while increasing the number of items to four in each scale for the pilot. 2. The construct B Self-concept and self-esteem presents two fundamental questions concerning its role in the test. The first concerns the role of the strongly reflective nature of (academic) self-concept, as compared to the future-oriented nature of the concept learning to learn. To what degree can the former be of use in measuring ones readiness to meet new learning challenges, regardless ones present level of proficiency? The second is the possible desirability of measuring a little more widely students subjective well-being as the factor backing ones readiness to venture toward new challenges. Or should some of the contentually less rigorous scales from ELLI, listed above as part of construct A1 or the construct C, be seen as an opening to this direction? The construct is recommended to be built on the bases of the two dimensions of academic self-concept and subjective well-being; the former comprising selfconcept scales for general ability and the two areas of mathematics and reading, the latter represented at this point by just the one scale on self-esteem (Table 31).
Table 31 Scales to be retained in the affective construct B Academic self-concept and self-esteem
B1 ACADEMIC SELF-CONCEPT Academic self-concept / thinking Academic self-concept / mathematics Academic self-concept / reader B2 SUBJECTIVE WELL-BEING Self-esteem

3. In the third affective construct C Learning environment, there seems to be a discrepancy between the actual and the (presumed?) intended coverage of the

76

construct. This is especially salient in the domain school, which at present comprises two kinds of scales. Despite its name, the scale learning environment: school clearly measures students personal views on the importance and interest of things learnt at school, and the scale group-work: task orientation inquires their work habits in class. Hence, if the construct C is meant to be a contextual factor for measuring the support or constraint outside factors entail on the formation of learning to learn, in their present form both would better fit the construct A1 for (personal) learning motivation and strategies. Also, as group work seems not to be equally common in all countries, a new set of items to measure students task orientation might be needed. Reflecting the above, a restructuring of the construct C1 might be warranted, using the current scale learning environment: teachers as its basis, with the addition of one or two dimensions related more closely to what the school does (or does not do) to support the formation of learning to learn (Table 32). For example, is the emphasis in teaching on passing tests and not failing exams or in raising students interest and opening new vistas for new questions, or are students also encouraged toward independent work alone or in groups on side of more teacher-dominated instruction? These too are dimensions touched by some of the ELLI scales but with sets of items that unfortunately do not form very reliable scales. The same hidden bi-dimensionality is true of the other two subconstructs family and friends as well. But, like in relation to school, as the contexts themselves are important, the two scales forming each of the current constructs will be presented in the recommendation of Table 32, despite the changes the scales might warrant, to better measure the support the different contexts offer to the development of students learning to learn competence.

77

Table 32

Scales to be retained in the affective construct C learning environment


C 1 SCHOOL Learning environment / school Learning environment / teachers C 2 FAMILY Perceived support / parents Learning relationships / family C 2 FRIENDS Perceived support / peers Learning relationships / friends

Like in the cognitive domain, the above presented recommendations should be understood as only a rough outline for the contextual part of the affective domain in the next phase, due to the many problems related to both the reliabilities of the current scales and the operationalisation of the respective constructs. However, scales with more rigorous theoretical backings and a better fit with the conceptual framework can well be built using the experience of the pre-pilot as a guideline.

3.3 The metacognitive domain


In the test, metacognition was measured with four different elements, most importantly with two questions posited immediately after four of the cognitive tasks (LagSev, Text, Meta and Lakes), forming two separate scales. The first question, I believe I did well in this task, was set to measure students ability to accurately evaluate their own performance (metacognitive accuracy). The other, I am confident of my answers, was set to measure their confidence in the accuracy of that evaluation. The metacognitive component also comprised a five item cognitive task referred to earlier as the Meta task and analysed together with the other cognitive tasks in subchapter 3.1, taken from the same original Spanish test on metacognition as the questions of accuracy and confidence, where its role is to ground the scales for the latter to students actual performance.

78

Fourthly, there was a yes-no question concerning students confidence in their own reasoning after each item in the Lakes task. First, the results of each of the different subscales will be looked at separately. After that, a synthesis will be made of the functioning of the metacognitive component as a whole. Finally, some propositions will be made toward the further developing of the metacognitive component for the next phase of the indicator project.

3.3.1 The metacognitive monitoring task Even if in one of the five items in the Meta task presented a novel situation for a math problem in not offering enough information to be solved (one of the four answer options in all the Meta items), the task does not offer extra information regarding students metacognitive monitoring compared to the other tasks to which the accuracy and confidence questions were attached. This might be due to all the items being fairly easy verbal arithmetic problems (mean percentage of correct answers 62 %, SD 27 %) with even the one different item easily detected and solved. Two of the items are just below the norm level 0 of the IRT scale while the other three fall just above or below the -1 level. Besides, the task does not form a very reliable scale (Cronbachs alpha .47; itemtotal correlations between.15 and .32). However, as the Meta task is the most schoollike of the cognitive tasks in the test, the two metacognitive questions set after it offer a possibility to set a base-line for the accuracy of students assessment of their own competence or success, in reflecting the feedback from school arithmetic classes.

3.3.2 Metacognitive accuracy and confidence Students responses to the two questions forming the scales for metacognitive accuracy and confidence will be analysed at two levels. First, the results will be analysed using the actual answers to the two questions with their five-point Likert-scales. This will allow looking for the intra- and inter-relations between students general and task-specific

79

success-related beliefs and confidence, their actual attainment in the tasks, and their general ability-related beliefs, manifested in their answers to the respective questions in the two affective questionnaires. For some analyses, a new variable has been created on basis of the relationship between students belief in doing well in a task and their actual performance in it. After that, the results will be analysed using a dichotomous variable based on the accuracy of students evaluations (for the calculation of this construct, see the Manual for calculating the cognitive, metacognitive and affective point sums). For the most part, students seem to have slightly overestimated their competence in the cognitive tasks 32 . A closer look reveals, however, that this is true only in the tasks LagSev and Text, while students have in fact underestimated their success with almost the same margin in the Lakes task. Instead, as could have been predicted, their evaluation of their performance in the school-like Meta task coincided almost exactly with their actual success in the task (Table 33).
Table 33 Students beliefs concerning their success in the different tasks and their attainment in the tasks (the 1-5 Likert scale has been converted to percentage to be comparable to students attainment in the tasks)
DID WELL N LAGSEV TEXT META LAKES 4532 4498 4602 3820 Mean 65 % 55 % 60 % 40 % SD 31.4 28.3 29.7 33.3 TEST Mean 47 % 42 % 62 % 48 % SD 30.7 17.5 27.0 20.8

Girls were more modest in their beliefs of their success in the different tasks than boys also in the LagSev task where their performance was significantly better. However, gender difference was most prominent in the Meta task, reflecting but exceeding the gender difference in students academic self-concept concerning their mathematical competence (F=79 vs. F=57).

32

To compare the two, the original Likert scale of the question has been transformed to percentages to match the percentage of correct answers (1=0 %, 2=25 %, 3=50%, 4=75 % and 5=100 %).

80

There were also between-country differences both in students evaluations of their success in the tasks and in the relationship between these evaluations and students actual performance i.e., in their metacognitive accuracy (Table 34). In most countries, students overestimated their success especially in the LagSev task and underestimated it in the Lakes task, even if there were differences in the level of this overestimation (F=20-32, p<.001). An exception to this was the Text task where the difference over the eight countries in students assessments did not reach statistical significance. Overall, the most modest even if not always least incorrect estimates were those of the Finnish students.
Table 34 Students beliefs concerning their success in the different tasks and their attainment in the tasks by country (1-5 Likert converted to %)
I believe I did well in the task LAG Austria Cyprus Finland France Italy Portugal Slovenia Spain 66 % 63 % 59 % 74 % 69 % 57 % 64 % 68 % TEXT 49 % 58 % 51 % 49 % 57 % 62 % 55 % 56 % META 59 % 67 % 56 % 55 % 64 % 59 % 59 % 64 % LAKE 41 % 34 % 37 % 36 % 51 % 41 % 34 % 45 % Task performance LAG 44 % 44 % 47 % 53 % 52 % 46 % 39 % 55 % TEXT 37 % 42 % 49 % 42 % 40 % 41 % 37 % 50 % META 61 % 60 % 70 % 58 % 64 % 59 % 58 % 64 % LAKE 45 % 43 % 53 % 49 % 51 % 46 % 43 % 52 %

Correlations between students evaluation of their success in the four tasks and their actual attainment in them reveal an interesting bi-dimensional quality of these assessments (Table 35). On the one hand, students beliefs concerning their success seem to form a scale of its own across all the tasks (Cronbach's Alpha .61), fairly independent of their success in solving them. On the other hand, each task-specific assessment seems to tap at least to some extent to students attainment in just that task. The only exception to this rule is the task Text which might have misled many students ability of selfevaluation as they might not have had any earlier reference for their evaluation of competence due the task being a novel type of reading comprehension test to most of them.

81

Table 35

Correlations between students believes of doing well in the different tasks and their actual attainment

I believe I did well in this task LAGSEV did well LAG did well TEXT did well META did well LAKE did well 0.23 0.29 0.24 0.31 0.32 0.35 TEXT did well META did well LAGSEV test 0.27 0.03 0.10 0.13 TEXT test 0.05 0.05 0.02 0.04 META test 0.16 0.06 0.33 0.17 LAKES test 0.09 0.06 0.08 0.26

Even if the questions measuring students beliefs in their success in the different tasks seem to form a scale of its own, this scale or the individual task-specific beliefs forming it have fairly low correlations with the more general self-ability belief measurements in the test (Table 36). Accordingly, the correlations between the self-evaluations of doing well and the adjacent tasks are not greatly affected by taking into account students general ability beliefs, meaning that the correlations between students evaluations and their performance might indicate real task-specific reflection.
Table 36 Correlations between students belief of doing well in the different tasks and some related affective scales 33
Agency: ability LAG did well TEXT did well META did well LAKE did well .23 .18 .25 .16 Selfconcept thinking .21 .21 .25 .20 Selfconcept math .21 .11 .35 .21 Selfconcept reading .14 .19 .08 .15 Control expecta ncy .16 .11 .14 .13 TEST GPA

.20 .08 .19 .19

.23 .08 .20 .23

Not only is students belief in doing well related to their success in the task but the accuracy of their evaluation is also related to their relative success in the task. While students belonging to the weakest performing quartile clearly overestimated their performance in most of the tasks (difference between belief and performance was 34 %,
33

Due to the large number of students in the weighted data, even the weakest correlations of r=.08 between some of the variables are statistically significant at level p<.001.

82

23 % and 10 %, respectively, in the LagSev, Text and Meta tasks), students in the best performing quartile erred only in the Meta and Lakes tasks but by assessing their success too negatively (differences -12 % and -14 %, respectively) 34 . The result is of special significance as the students in the best performing quartile seem to be especially acute in assessing their performance in the two tasks (LagSev and Text) that measure thinking skill in a way that is fairly far removed from school-type tasks in those two domains, mathematical thinking and text comprehension. In these tasks, unlike in the Meta task, students can be considered to be more on their own in their evaluations, lacking easily applicable previous feed-back to base their evaluations on. All in all, the questions regarding students belief in their success seem to tap an ability to assess ones own performance in a way that could be measured as one facet of metacognitive competence. The question in the test concerning metacognitive confidence, attached to the same tasks as the questions concerning students belief in their performance, seems to function in much the same way as the former. The scale forms a fairly coherent scale over the four tasks (Cronbach's Alpha .61), and each task-specific assessment seems to tap at least to some extent to students attainment in just that task though again with the exception of the Text task (Table 37). However, the reason for the similarity between the results of the two questions might be the simple fact that the wording of the item on confidence offers little possibility for contradicting the first. What would it mean to believe that one has not done well and still be very confident of ones answers?

Differences between the quartiles in the accuracy of their beliefs were very clear (ANOVA FLag=204, FText=82, FMeta=95 and FLakes=45).

34

83

Table 37

Correlations between students confidence in their answers in the different tasks and their actual attainment

I believe I did well in this task LAG conf. LAG confidence TEXT confidence META confidence LAKE confidence 0.25 0.33 0.26 0.40 0.35 0.39 TEXT conf. META conf. LAG test 0.22 0.02 0.10 0.09 TEXT test 0.02 0.02 0.02 0.01 META test 0.12 0.02 0.27 0.14 LAKES test 0.07 0.03 0.08 0.23

Still, the question of confidence, maybe because presented and hence tied to the respective task only after the first question of ones belief in having done well, seems to bee one small step further removed from the general ability-related believes of the students (Table 38). The weaker connection between confidence and GPA reflects the bigger gender difference in students stated confidence than in their belief in having done well in the tasks.
Table 38 Correlations between students confidence in having done well in the different tasks and some other affective scales 35
Agency: ability LAG confidence TEXT confidence META confidence LAKE confidence .19 .20 .25 .15 Selfconcept thinking .18 .23 .26 .19 Selfconcept math .20 .10 .31 .22 Selfconcept reading .13 .14 .08 .13 Control expecta ncy .12 .10 .14 .10 TEST GPA

.17 .04 .18 .17

.18 .07 .16 .14

Like students belief in their success, also their confidence in this success was related to their actual performance in the tasks, even if the connection was a little weaker. Most probably the difference reflects the tendency of the weaker students to overestimate their performance whereas the better students estimates either were accurate or they even underestimated their success.

35

Due to the large number of students in the weighted data, even the weakest correlations of r=.08 between some of the variables are statistically significant at level p<.001.

84

Put together, the scales of accuracy and confidence reach a reliability of =.82 but due to the interconnectedness of the two questions, the combined scale does not allow for a coherent interpretation due to the asymmetry of the scale in regard to students who did or did not believe having performed well.

3.3.3 An index for metacognitive accuracy and confidence On side of looking at students actual answers to the two metacognitive questions, new dichotomous variables were built to gauge the correctness of students evaluations in light of their performance. This was done by dividing students to two groups depending on the degree of accordance of their belief in doing well to their actual performance, by transforming the Likert-scale into percentages and relating that to the percentage of correct answers in each of the tests. The cut-point for sufficient accuracy for both metacognitive evaluation and confidence was set at 25 % at both ends of the scale, i.e., for having performed well and knowing it (+ +) and for having performed poorly and knowing having done so ( ). This led to 50 % to 59 % of students being classified for metacognitive accuracy in the different tasks. The share was biggest in the Meta task and lowest in the LagSev task. Regarding metacognitive confidence, the percentages were 51 % to 58 % with the share of students confident of their success biggest in the Text task. In accordance with the analyses based on the actual values of the Likert scale, presented above, the percentage of students in the metacognitively accurate group increases with the level of performance and, like in the previous analyses, the difference is most prominent in the tasks LagSev and Text and non-existent in the task Lakes. In concordance with this, even if the separate accuracies of students evaluation of their success in the different tasks do not form a coherent scale (Cronbach's Alpha .24), the number of accurate evaluations students make seems to be a good indicator for finding the students who also perform best in the cognitive learning to learn tasks, especially among boys (Table 39).

85

Table 39

Test competence of students according to metacognitive accuracy of evaluation of own performance (number of correct evaluations in four tasks)

Number of accurate task evaluations 0 1 2 3 4 Total

BOYS N 199 575 690 573 193 2229 TEST


36

GIRLS SD 0.60 0.64 0.63 0.60 0.70 0.67 N 235 546 807 647 220 2455 TEST -0.23 -0.11 -0.02 0.19 0.34 0.02 SD 0.62 0.64 0.59 0.59 0.64 0.64 -0.48 -0.17 -0.03 0.15 0.44 -0.02

3.3.4 Conclusions regarding metacognition In the learning to learn framework, metacognition has been operationalised as students ability to reflect on and evaluate their own performance. The results reported above might be seen to support the interpretation of the measured phenomenon as accuracy of metacognitive evaluation. The conclusion has to be regarded as only contingent, however, in lack of other outside reference than the shown test competence on which the measurement is built. Still, both in light of these results and of the conceptual framework behind the pre-pilot, it is obvious that metacognition should be seen as an essential component of learning to learn and hence has its place in the instrument for its measuring. The two questions used for the measuring of metacognition, however, are conditional on each other in a way that prevents their full use as two independent dimensions of students task- and / or self-related reflective evaluation. Also, there seems to be no reason why questions aiming to measure students metacognitive competence could not be based only on cognitive tasks that directly serve the measuring of the cognitive component of learning to learn. This could further free the metacognitive component

Test competence is indicated by standardized value as differences in the mean percentage of correct answers are related to the mean accuracy of students evaluations of their performance.

36

86

from students academic self-concept(s), helping to differentiate between students ability of general and context-bound evaluations of their performance. Accordingly, further work is still needed to reinforce the metacognitive component in the instrument to form an independent third dimension on side of the cognitive and the affective ones. It might be warranted to return to the theoretical literature concerning metacognition, and to discuss once more which dimensions or sub-areas of metacognition can reliably be operationalised for large scale pen-and-pencil type of assessment. The crux of the problem seems to be in how both metacognition in action, something that students responses to the different questions can be taken as an indicator for, and their beliefs concerning their metacognitive processing, the feature that the current metacognitive component rather seems to measure, could be even better integrated in the instrument. One way to enrich the metacognitive component would be to bring some of the questions now set in the end of the booklet to the end of each cognitive task (interest, difficulty, novelty, etc.). In this way, students were encouraged to reflect on the tasks and on their work, and this (even if extrinsically) enforced moment of reflection might also serve the students in moving from one task to another and help them look for the novel qualities of each in a more concentrated manner.

3.4 The three-dimension framework of learning to learn


To estimate the viability of operationalising learning to learn through the three dimensions cognitive, affective and metacognitive, each comprising, respectively, the five, seven, and two subconstructs, both regression models and structural equation modelling has been used. Lacking other indicators, students school achievement, operationalised as country-specific GPA quartiles to allow for differences in marking, has been used to stand for the outcome of learning to learn.

87

Despite the problems in the cognitive, affective and metacognitive constructs reported above and the recommendations made for their revising, modelling has been done on the basis of the original constructs to further validate those analyses. A list of the constructs used in the different models is presented in Table 40.
Table 40
The cognitive, affective and metacognitive dimensions with their respective subconstructs

THE COGNITIVE DIMENSION

A B C D E A1 A2 B1 B2 C1 C2 C3 A B

Identifying a proposition Using rules Testing rules and propositions Using mental tools Applying reasoning in everyday problems Learning motivation : supportive attitudes Learning motivation : detrimental attitudes Academic self-concept Self-esteem Learning environment : school Learning environment : family Learning environment : friends
37

THE AFFECTIVE DIMENSION

THE METACOGNITIVE DIMENSION

Metacognitive accuracy Metacognitive confidence

Based on the above reported lacking uni-dimensionality of most of the constructs, both regression modelling and SEM was used. A preliminary regression model, based on the three main constructs, is presented in Figure 14.

The metacognitive monitoring task was left out of the model as it presented no special connection in relation to the questions measuring metacognitive accuracy and confidence compared to the other cognitive tasks.

37

88

g1
,26

affective
,20 ,14 ,45

GPA QUARTILE

competence
,32

,08

,00

metacogn
Figure 14 The cognitive, affective and metacognitive dimensions in explaining GPA, regression model 1

The three dimensions explain together 26 % of the variation in GPA. The cognitive component (competence) is twice as powerful as the affective in explaining the variation but the latter clearly provides an independent contribution of its own. The metacognitive dimension, instead, does not provide any independent explanatory power as the relation to achievement it was shown to have in subchapter 3.3, is already embedded in the impact of the other two dimensions in the model, due to collinearity. A second regression model was constructed leaving out the metacognitive component but taking the cognitive and the affective components in the model using their respective subconstructs. Of the cognitive subdimensions, all but D (the Lakes task) have their unique impact in explaining GPA. In the affective domain, the two subdimensions C1 learning environment: school and C2 learning environment: family dont have an independent contribution in explaining GPA, which, in the case of C1, might reflect the overlap of some of its items with the construct A1, discussed in 3.2.4. The third dimension of the learning environment construct, C3 (peers) which was shown to have a zero-correlation with GPA in Table 25, turns out to be negatively correlated to GPA, indicating that the positive aspect of the construct is already seized by the other dimensions with a positive correlation with GPA, leaving only the counter or

89

outside of school aspect of the impact of peers to have an independent role 38 . The same is true regarding self-esteem (B2); a problem discussed already earlier and recommended to be solved by using a broader indicator for subjective well-being. When the model is built to include the cognitive and the affective subconstructs (the metacognitive domain was left out based on model 1), the best predictor for GPA is the affective construct B1, confirming the earlier presented view that students positive view on their abilities is strongly related to and most probably fairly directly reflects their accomplishments at school, contributing little to understanding reasons that lead to that accomplishment. Despite the lacking uni-dimensionality of the different constructs (a problem from the point of view of the prerequisites of SEM), a structural equation model (AMOS) was made on the basis of the subconstructs of the three main domains (Figure 15). Contrary to the framework, the metacognitive component does not form a dimension of its own but forms part of the cognitive domain. The high covariance between the two metacognitive constructs and their relative independence of the rest of the cognitive component reflect the technical fact that they are built to award as metacognitively competent both those students who correctly judge themselves as having performed well in the tasks (+ +) and those who are correct about having performed poorly ( ). In the model, the affective dimension is best measured by construct A with the contribution of B and C less than half of that. The cognitive domain shows a more balanced structure concerning the five subconstructs, with the constructs B and E (the LagSev and CCST tasks) providing the best and the construct D (the Lakes task) the weakest contribution in the cognitive domain. The negative correlation between B and E indicates a systematic difference between students doing well in the more rigorous mathematical LagSev task and the more applied and age-referenced CCST task.

A preliminary correction might be attained by rewording the ELLI items more rigorously regarding what kind of cooperation with friends is meant. At present, one student may have referenced his/her answer to discussing problems in personal life while another has thought of tomorrows exam.

38

90

,81 ,90 ,54 ,62

affectiveA
,29

a1 a2 a3

Aff

affectiveB
,38

affectiveC

,25 ,14

metaA
,09 ,34 ,27 ,37 ,29 ,23

m1 m2

,71

metaB

GPA QUARTILE

g1
,47

,48

cognA
,52

c1
-,17

,72

cognB
,32

c2
-,19

,57 ,41 ,71

cognC
,17

c3 c4
,50 -,27

Cog

cognD cognE

c5

Figure 15

The cognitive, affective and metacognitive subconstructs explaining GPA, structural equation model 1 (AMOS)

A further AMOS model 2 was constructed to study the interconnectedness of the three main domains, further confirming the non-unidimensionality of the domains, comprising even components with negative correlations with test attainment and GPA (Figure 16).

91

,68 ,83 ,57 ,73

affectiveA
,32

a1 a2 a3
-,09

Aff

affectiveB
,49

affectiveC

,46

-,22

,14

metaA
,09 ,39 ,31 -,16 ,37 ,29 ,23

m1 m2

,71

metaB

GPA QUARTILE

g1
,41

,48

cognA
,52

c1
-,17

,72

cognB
,32

c2
-,19

,57 ,41 ,70

cognC
,17

c3 c4
,50 -,27

Cog

cognD cognE

c5

Figure 16

The cognitive, affective and metacognitive subconstructs explaining GPA, structural equation model 2 (AMOS)

All the models presented above can be seen to support the conclusions presented earlier in the respective subchapters, regarding the desirability to revise the test using the compiled pre-pilot data as a point of departure. These recommendations will be summed up in Chapter 4, after a short overview of students views on the test in the next subchapter.

92

3.5 Students view on the test


Students opinion of the test was in most respects slightly more positive than negative. Booklet 1, which was shorter with its three cognitive tasks compared to the four of Booklet 2, was received a little more favourably (Table 41). The comparison is complicated, however, by the fact that two thirds of students were first given Booklet 1 while only a third and in just six countries began with the longer Booklet 2.
Table 41
Students responses to the question How did you find the test booklet?

Booklet 1 Mean 1 2 3 4 5 6 7 I found the test in this booklet easy to complete I found it easy to concentrate through this test booklet I found the language in this test booklet easy to understand I tried my hardest in this test booklet I was motivated to complete this test booklet Learning in school really helped me answer this test booklet What I have learned outside of school was helpful in this test booklet 3.36 3.59 3.69 3.95 3.39 2.88 3.26 SD 1.08 1.16 1.12 1.07 1.20 1.21 1.16

Booklet 2 Mean 3.17 3.46 3.69 3.81 3.30 2.95 3.24 SD 1.08 1.17 1.09 1.13 1.23 1.23 1.17

Differences related to the order in which students did the booklets was most obvious concerning to have done ones best, with students completing Booklet 2 first saying to have put clearly more effort in that booklet than those for whom it was only the second one (4.06 vs. 3.68, ANOVA F=118). However, students beginning with Booklet 2 might just have been more diligent or used the Likert scale in a more up-beat mode as their reported applied effort declined only 4 % from the first to the second booklet compared to the 7 % for those beginning with Booklet 1. Students beginning with Booklet 2 also found Booklet1 easier than students for whom it was the first one (3.52 vs. 3.27, F=54), even if they had found Booklet 2 even easier contrary to the other group who found their second booklet more difficult than the first one (3.65 vs. 3.36). Part of the differences reported above seem to be related to differences between the students from different countries, with French, Portuguese, and Spanish students either

93

really having been more diligent in completing the tasks, or having a tendency to use more often the maximum value(s) of the Likert scale (the mean for students answer to question 4 in both booklets was for these three countries 4.24, compared to 3.81 for Austria, Finland and Italy, and 3.41 for Cyprus and Slovenia). Students in Cyprus and Slovenia also conveyed least motivation in completing the booklets (mean 2.91) while differences between the other countries were relatively smaller (3.27-3.55), excepting the higher estimates of the Spanish students in both booklets (mean 3.83). Many students seem to have answers the two questions concerning the role of school and the world outside of school in helping them solve the tasks in the booklets as competing, not complimentary; i.e., if it is school, it cannot be the world outside, and visa versa. Accordingly, the answers are not easy to interpret. In all countries but Spain (and in Booklet 2, Portugal), students saw learning outside of school to have prepared them more to solving this kind of problems than what the school. Still, it might be too hasty to conclude that the result or the between-country differences in students views concerning Booklet 2 (in Booklet 1 there were no differences) would indicate education systems ability to foster learning to learn. Rather, most probably the results just imply that students see the tasks to differ from those in text books, leading to the (false) conclusion that school has not given them much in terms of helping to solve them. And if it is not the school, it must be things learnt somewhere else 39 . Of the approximately 4640 students (weighted data) answering the questionnaires in the end of the booklets, 1754 gave at least one reason for not finishing the tasks in time a share that exceeds the attrition in any of the tasks most often just by ticking the box I do not know (Table 42). As could be seen in Table 10, finding the time allocated to the test insufficient was most common among students beginning with Booklet 2, in that Booklet.

The complementary interpretation is supported by the fact that the mean for the two questions is 3.1 (SD 1) for both booklets.

39

94

Table 42

Percentage of students giving different reasons for not finishing the booklet in time Booklet 1 Booklet 2 4% 11 % 7% 5% 11 %

I gave up half way because it was too difficult There was not enough time to answer all the questions I could not concentrate I was not motivated I do not know

4% 6% 7% 5% 11 %

Even if students were not overtly enthusiastic about the tasks, their evaluation of their interest in them fell above the middle point of the scale in all tasks (Table 43). Girls found the questionnaires and the CCST task in Booklet 2 more interesting than boys, while boys found the Meta task slightly more interesting (in both cognitive tasks, the difference is significant only at level p<.01). There were also differences between the countries in students expressed interest in the different tasks, with the Finnish and Slovenian students finding the tasks less and the Portuguese and Spanish students more interesting than others. Or, as the differences are not repeated in the results of the cognitive tasks, for example in attrition, but coincide with those in students responses to the general questions concerning the test, the differences might rather reflect cultural differences in scale use.
Table 43
Students expressed interest in the different tasks with gender and between-country differences (ANOVA, F) Mean 3.56 3.52 3.25 3.11 3.46 3.44 3.30 3.16 3.14 SD 1.34 1.28 1.38 1.41 1.37 1.27 1.29 1.36 1.45 Gender 23 25 10 7 Country 58 34 27 39 64 53 65 33 45

BOOKLET 1 QUESTIONNAIRE CCST 1 LAG SEV CHOCOLATE BOOKLET 2 QUESTIONNAIRE CCST 2 TEXT META LAKES

95

Girls who found the questionnaires more interesting, also found them easier to answer, even if also boys seem not to have much difficulty in deciding on their opinions (Table 44). There were no differences in boys and girls views concerning the difficulty of the learning to learn tasks but boys, maybe reflecting their slightly better performance in the task or just their more positive self-concept in mathematics, found the more school-math like Meta task easier than girls. However, boys found the CCST task in Booklet 2 slightly more difficult than girls again correctly reflecting the difference in performance.
Table 44
Students opinion on the difficulty of the different tasks with gender and betweencountry differences (ANOVA, F) Mean 1.94 2.39 2.89 3.09 2.10 2.44 2.71 3.00 3.20 SD 1.20 1.22 1.37 1.42 1.29 1.23 1.25 1.31 1.44 Gender 29 11 9 25 Country 9 5 23 23 6 5 7 9 20

BOOKLET 1 QUESTIONNAIRE CCST 1 LAG SEV CHOCOLATE BOOKLET 2 QUESTIONNAIRE CCST 2 TEXT META LAKES

Between-country differences, instead, are clearly smaller when students are assessing the difficulty of the tasks than their interest (all the marked differences are significant at level p<.001, however). The Portuguese and Spanish students have found the two mathematical tasks, LagSev and Meta more difficult while French (especially in the case of LagSev), Austrian and Slovenian students have found them easier. Students views on the interest and difficulty of the tasks form in themselves relatively reliable scales (in the two booklets, =.75 / .84, and =.64 / .74 for interest and difficulty, respectively, with all item-total correlations >.50 for interest but lower for difficulty). Correlations between students interest or difficulty assessments and the respective tasks, instead, are fairly low (r=.04-.15 for interest and r=.03-.15 for difficulty) while those between students expressed interest in the questionnaires were more strongly 96

related to their expressed opinions in the affective learning to learn questionnaires, especially with the subcontracts A1 and C1, and is stronger for girls than for boys (r=.48 and r=.47 vs. r=.38 and r=.44).

4 Learnings from the learning to learn pre-pilot test recommendations for the next stage
The test built on the bases of the European leaning to learn framework and its prepiloting in eight countries has been an impressive achievement and contribution to the indicator work lead by the European Commission under the Open Method of Coordination (European Council, 2000). The test was compiled from elements with different theoretical underpinnings and some of the components were reduced in scope in away to maybe loose some of their essential characteristics, due to time constraints in the testing situation at schools. And still, a test was built within tight time constraints and tried in eight European countries during the spring 2008. The four research units providing parts of their instruments to the test as well as the schools and students in the eight countries participating in the pre-pilot have provided an invaluable contribution for the endeavour. Without them there would have been no test and no data to analyse and on base of which to formulate recommendation for the further development of the future indicator for measuring learning to learn. The objective of this report has been to examine the working of the current test in the eight-country pre-pilot, not to propose alternative or even reparative solutions for a new or better test. However, pointing out strengths and weaknesses in the current test as revealed by the analyses presented in this report naturally raises views on how the instrument could be further developed. The results give no reason to renounce the framework or its operationalisation into the cognitive, affective and metacognitive dimensions. Within each of these domains, however, the results of the pre-pilot test 97

indicate needs for readjusting both the framework and its operationalising into tasks and sets of affective scales. In this chapter, the conclusions presented originally in the end of each subchapter will be drawn together. After that, a list of steps will be proposed to be taken to bring the test to a level to be tested in a pilot, hopefully covering all the member states of the European Union. In the cognitive domain, two kinds of problems were revealed by the pre-pilot. First, the fit between the four-dimension framework (identifying a proposition, using rules, testing rules and propositions, and using mental tools) and the tasks used for measuring them was shown to be problematic. Second, many of the tasks revealed problems in inner consistency or level of difficulty regarding the sampled age cohort, or both (chapter 3.1). To remedy these shortcomings, the following steps are recommended to be taken: In the framework, a new dimension is recommended to be included, covering the domain measured by the CCST task. The recommendation is based dually on the fact of the CCST items performing poorly in the constructs they are currently part of, and to clearly measuring a dimension outside the range of the other tasks, namely applying reasoning in the realm of everyday life dilemmas. We feel that acknowledging this difference provides a possibility to extend the cognitive dimension of the framework and the instrument to comprise the kind on competence that students dont necessarily recognise as schoolrelated but which would indicate willingness to tackle problems requiring a personal cognitive investment. Problems of reliability and limited coverage of the competence range of students partaking in the pre-pilot were true of all the tasks, at least to some degree. Many of the tasks shared the same problems: CCST (low reliability but for the high number of items, items of only low to mid-low difficulty level), Choco (good reliability but poor distribution along the difficulty range), Text (low reliability, no items below the mid level of difficulty), and Lakes (lacking reliability, poor distribution along the difficulty range).

98

The same was true of the Meta task (low reliability, all items at the mid to low difficulty range). Only the LagSev task worked well psychometrically, but had a few items too little to provide a full-bodied scale for the fairly wide competence range of the students. Accordingly, the following recommendations have been made regarding the cognitive component of the test: 1. Retaining the basic model for the cognitive component, with the exception of removing the CCST items from their current constructs to form a new construct that could be called reasoning in everyday situations. 2. Revising the different tasks to a varying degree, aiming to tasks with a fairly equal number of items to balance the instrument between the five cognitive dimensions it measures. The key steps to be taken in the different domains / tasks would be: 1. The CCST task, a) Replacing poorly functioning items and adding new ones to form a coherent test with 10 items, covering more evenly the attainment distribution. b) Dispensing with the introductory stimulus to the items. 2. The LagSev task a) Increasing the number of items to 10 to cover more evenly the full distribution of student attainment, and to facilitate comparisons with the competencies measured by the other constructs. 3. The Choco task a) Revising the item structure of the task to cover more evenly the cognitive object of the task, i.e., students understanding of changing variables in comparisons. b) Resetting the task to allow a vocabulary more familiar to students and renewing the layout of the task

99

c) Ensuring that the revised test covers more evenly the range of student attainment, offering items at different levels of difficulty. 4. The Text task a) Transferring the basic structure of the task to a shorter explorative text. b) Reducing the number of items to cohere with those of the other tasks, still allowing for the variation of the three levels of propositions. 5. The Lakes task a) The task is recommended to be changed to a new problem solving task, truly combining the dimensions covered in the other tasks and requiring students to choose and use information from a stimulus comprising text as well as graphs and tables. The recommendations would lead to the cognitive component of the test comprising somewhat fewer items than the pre-pilot test (50 vs. 65) but covering the difficulty distribution more equably across the different subdomains. Also in the affective domain, two kinds of problems were revealed. First, related to the fit between the three subconstructs of the framework (learning motivation and strategies, academic self-concept, and learning environment, for short) and the scales used for their measuring, and second, to the reliability of many of the scales themselves (chapter 3.2). Regarding the framework, a discussion should be opened concerning the exact nature of or expectations for the three constructs. In all, some of the sub-domains or scales pose problems regarding the supposed core of the construct. Regarding the inner coherence of the scales, the cutting of the length of the FILLS scales to just two items each was clearly a mistake, while many of the longer ELLI scales reach even close to acceptable reliability simply due to their length. To remedy this, a general recommendation will be made to extend all the affective scales to comprise four items, at least at the pilot stage. If scales are found that allow going back to just three items each, the decision should be based on a wider European data than was now the case.

100

In learning motivation, learning strategies and orientation toward change, the main question is the differing nature of the supportive and the detrimental affective components, aggravated in the present instrument by the poor inner coherence of most of the negative scales. Also, some of the scales currently classified in the third construct, learning environment, might more rightfully belong in this construct. Furthermore, some of the scales included in the learning motivation construct clearly measure students views on their own ability, impeding the applicability of an indicator relying on the construct to students of varying ability level, while the scale measuring self-esteem in the latter is clearly not on a par with the scales measuring academic self-concept. The results within the third construct, learning environment, evokes a more fundamental question concerning the objective for including the construct in the test. Is the objective to measure students views on themselves in the different environments or to try to measure the environments as such surely as experienced by the students to be able to draw conclusions regarding the kind of contextual features that will promote the formation of learning to learn? As this is not clear, the recommendations made in chapter 3.2.4 might be resting on false premises. Considering the problems met in the pre-pilot, the following recommendations have been made regarding the affective constructs: 1. The construct learning motivation, learning strategies and orientation toward change is recommended to be divided to two separate subconstructs, representing the positive and the negative dimensions affecting students willingness to accept and undertake new challenges in learning and (cognitive) development. The number of scales can be reduced to about 10 and 5, respectively, adding the scale taskorientation from the current construct learning environment to the construct. Most of the scales need reworking to increase their inner coherence. 2. The latter dimension of the construct self-concept and self-esteem is recommended to be enlarged to cover subjective well-being more widely or, if that is not deemed

101

necessary, to drop the one scale for self-esteem from the construct as it clearly does not form a coherent part of the self-concept in the way it is now measured by the other items. However, also the role of this closely school-feedback-related dimension in the framework for learning to learn might merit discussing. 3. If the character of the learning environment construct is to provide feedback in the sense of revealing contextual factors that, when contrasted with the other data, support the formation of learning to learn, the subscales and items in the construct should be revised from that point of view, omitting such that rather measure personal traits. However, even if the objective is to help draw a profile of a good learner to learn in different contexts, it could be said that the present scales offer the possibility for only a very narrow scope of different types of learners to learn to find themselves with the questions. In the metacognitive domain, the monitoring task clearly seemed to offer no extra information concerning metacognition compared to the other cognitive tasks. Also, the two questions used for the measuring of metacognitive accuracy and confidence are conditional on each other in a way that prevents their full use as two independent dimensions. Accordingly, further work is needed to establish the metacognitive component as a full independent dimension in the instrument, on side of the cognitive and the affective dimensions. One way to reach this would be to bring some of the questions now set in the end of the booklet to the end of each cognitive task (interest, difficulty, novelty, etc.). In this way, students were encouraged to constantly reflect on their work, while this moment of reflection might also serve the students to look for the novel qualities of each task in a more concentrated manner. * The pre-pilot has offered a possibility to both analyse the functioning of the test in a content validity and psychometrical sense, and to reflect on the relationship between the

102

test and the framework it relies on. The conclusion made and the recommendations proposed have to be understood as based on a sincere wish to help the further development of the test to still accommodate the different paradigmatic premises of the tests it is based on. The ideal of psychometric rigorousness, more fundamental for the cognitive psychological than the social cultural tradition, has acted as a guiding principle in the analyses and in the criticism directed to the poor working of certain constructs. However, it has been done on purpose, with the further goal of being able to clearly identify and name factors that can be seen to be related to learning to learn competence to be in turn able translate these findings to recommendations for teachers and schools to foster the development of these competences and attitudes in their students. All in all, we hope that the report will help in opening a next stage in the discussion toward a common European indicator for learning to learn, begun already in the 1990s, leading to a wider pilot in near future, with a revised test based on the current instrument.

103

You might also like