C

Beyond Control of Variables:
Introducing Academically Disadvantaged Young Science Students

to Practices of Science
Deanna Kuhn and Toi Sin Arvidsson

Teachers College Columbia University
Author contact: dk100@tc.columbia.edu
Beyond Control of Variables:

Introducing Academically Underserved Young Science Students
to Practices of Science
Abstract
The broad objective of the present work was to successfully engage lowachieving, underserved middle-school students in activities introducing them to
practices of science. These extended beyond the typical univariable control-ofvariables strategy to include attribution and prediction involved in coordination
of multiple variables affecting an outcome, as well as practices of argument and
counterargument in advancing and challenging claims. Social science content
was used as a way to help students see the purpose and value of scientific
practices. The broad objective was largely met, as evidenced by near and far
assessments of transfer and maintenance among two 6th grade classes (N=49),
both outperforming a control group (N=23). Although students engaged
successfully in argument and counterargument as part of the intervention
activities, less successful was far transfer to meta-level reasoning about
argumentation and nature of science. Finally, and importantly in its practical
implications, one of the groups of intervention students showed less gain in 10
45-min classroom sessions than did the other group who engaged as pairs in
the same sequence of activities over an average of six 24-min individualized
sessions.
The widening gulf between the advantaged and disadvantaged segments

of American society is a topic of increasing concern, especially to the extent
that the education of the least advantaged students is so unproductive as to
deny them opportunity for mobility. High-quality education for all students
across the socioeconomic spectrum enjoys wide acceptance as an objective, yet
views diverge regarding the best way to achieve this objective. Approaches to
literacy goals, for example, range from extensive practice comprehending
decontextualized text passages to raise test scores to approaches that promote
literacy as a cultural practice (Applebee, 1996).
In the domain of science, the challenge of identifying the best approach
is even greater due to less certainty regarding the objectives of early science
education. Science educators are aware that beginning students can be
introduced to at most a smattering of contemporary scientific knowledge, and
many have come to the view now explicit in the new Next Generation Science
Standards (NGSS) that a key objective of science education should be to
acquaint students with science as a practice (Ford, 2012; Kelly, 2008: Lehrer &
Schauble, 2006, 2015; Manz, 2014; Osborne, 2014; Sandoval, 2014). Science
practice encompasses a range of activities that include posing questions,
developing hypotheses, designing and conducting experiments, examining and
interpreting data, and debating conclusions. It is believed most effective if
students engage in these activities for themselves, but the best ways to involve
them in such activities is far from established. The activities are more than
simply procedures and invoke ways and standards of knowing, e.g., what counts
as effective evidence and argument (Sandoval, 2005). Thus, to become
educated in the practice of science, students need to engage in communities of
scientific inquiry in which they develop shared objectives, methods, norms, and,
ultimately, it is hoped, values. In so doing they become participants in shared
practices that are part of larger disciplinary norms and traditions they come to
3
recognize and appreciate (Applebee, 1996; Forman & Ford, 2014; Lehrer &
Schauble, 2015; Sandoval, 2015).
These objectives of science education are a tall order to implement for
any students, let alone academically underperforming, disadvantaged ones.
Rudolph (2014) notes that such objectives in fact go back as far as Dewey and
that the problem lies not with these educational goals but with schools seeming
inability to devise workable school experiences to achieve them. What appear
to be the most promising methods have yet to be widely implemented in
classrooms.
Adding to the challenge is the restricted way in which scientific practice
has been defined, both in the K-12 classroom and in educational research. In
both cases science practice typically has been taken to mean use of the
scientific method, which in turn has been regarded as the design and analysis of
a controlled experiment. Moreover, the experiment is a univariable one, its
essence being the control (by holding constant) of variables (COV) in order to
identify the effect of a single variable on an outcome. This univariable model of
a rudimentary science experiment is not only the staple of classroom
introductions to the scientific method. Mastering COV similarly has until recently
been a primary focus of research on the development of scientific thinking (for
reviews see Kuhn, 2011; Lehrer & Schauble, 2006, 2015; Zimmerman, 2007).
In the real world, in contrast, outcomes are most often the consequence
not of a single cause but of multiple factors acting in concert, a fact that
practicing scientists are well aware of and take into account in both their
theoretical models and empirical investigations. The univariable logic and
execution of COV represents at most one narrow slice of authentic scientific
inquiry, and the most recent writing on developing childrens competency in
science emphasizes involving students not in acquisition of a tool kit of discrete
skills such as COV but rather in the practice of science as an authentic,
integrated whole (Lehrer & Schauble, 2015).
This is the approach we have sought to implement in the work presented
here. Moreover, we seek to establish that the objective of introducing beginning
4
students to science as a multifaceted, comprehensive practice is possible even

among the most challenging populations, where our work has been
concentrated underserved, underachieving, largely disengaged urban middleschool students. A traditional approach to getting young students interested in
science has been to show them surprising phenomena --- miniature erupting
volcanoes and such that elicit an Oh, wow! response but dont lead much of
anywhere.
In contrast, our approach is focused on getting students to experience
that the methods of science have purpose that makes sense to them and hence
are of value. Most important to the approach implemented here, then, is that
students activity be situated in the context of what students will see as a
meaningful purpose and goal. In two previous studies (Jewett & Kuhn, 2015;
Kuhn, Ramsey, & Arvidsson, 2015), urban, largely underperforming middleschool students addressed, for example, the topic of juvenile delinquency,
among other social science topics. We told them that social scientists have been
studying the matter for some time and data are available on teen crime rates
and on potentially predictive factors such as education, family income, and
population density; students were asked to help a foundation identify factors
that predicted teen crime rate so that these factors could be targeted in efforts
to reduce teen crime.
Our reason for employing social science topics such as teen crime goes
beyond the fact that such topics are ones our student population are familiar
with. Although students will feel they already know something about them, they
likely will not know that such topics are the stuff of science. What better way,
then, to get them to see its power and relevance? In the course of such
activities, students come to see how their (and others) beliefs about a
phenomenon like teen crime are subject to influence by means of application of
a scientific method. If students make this discovery for just a single topic, it
likely will occur to them that the same could be true for other topics, leading
them to broader appreciation of scientific practice.
In the present study, we worked with students from a similar population

over an extended period and hence were able to engage them in a way that
captured and integrated multiple strands of scientific practice. We included COV
but went beyond it, making it clear to students that multiple factors were likely
contributors to the outcomes of concern and hence needed to be examined,
taken into account, and their effects coordinated. A data analysis tool for K-12
students, InspireData, was integral to this objective as it allows students to
visually represent the effects of multiple factors. Students could then use the
multivariable understanding they gained to predict outcomes based on
evidence across multiple variables and thereby achieve the larger goal of the
activity drawing on evidence, rather than only their own beliefs, as a source in
seeking to understand the relationships being examined. Finally, we engaged
them in scientific writing in the form of reports to the sponsoring foundation
regarding their findings. This activity included addressing challenges to their
claims, thus exercising skills of both argument and counterargument.
Our pedagogical method can be characterized as one of guided inquiry.
The phases of the investigation were segmented for students into a sequence of
component tasks, with care taken to make clear the purpose and goal of each
one and its purpose within the larger task. For example, with respect to COV,
research has shown that students do better if they are guided to pose for
investigation an appropriate question regarding the role of one variable at a
time (Kuhn & Dean, 2005; Lazonder & Kamp, 2012), progressing through the
variables in sequence. Students are not given direct instruction as to strategies
to apply to the component tasks; rather, attention is focused on the task goal
and on their coming to recognize the weaknesses of inferior strategies they use
in not achieving this goal (Kuhn & Pease, 2008, 2009; Lorch, Lorch, Freer et al.,
2014). As a culminating activity, they reflect on how their final conclusions differ
from their initially solicited beliefs about the roles of each of the factors. Doing
so leads to reflection on the task as a whole and on how their evidence-based
conclusions provide the knowledge necessary to achieve the best task solution.
Participants in the present work come from the low SES, low-achieving
middle-school population in which several researchers have reported it difficult
to develop rudimentary scientific thinking skills, compared to success in doing
so among more privileged groups (Kuhn & Dean, 2008; Siler, Klahr et al., 2010;
Lorch, Lorch, Calderhead et al., 2010). Some of these low-performing students,
for example, in seeking to design an experiment fail to manipulate the focal
variable. This failure can be attributed at least in part to absence of more basic
understanding of the purpose of scientific investigation as a) seeking to answer
questions whose answers are not already known, and b) engaging in causal
analysis, rather than seeking only to optimize outcomes (Schauble, Klopfer, &
Raghavan, 1991).
Furthermore, it is widely observed that students in such populations
typically show little interest in or disposition to study science. This population,
then, seemed to us an especially important one to reach and achieve success
with, remaining mindful of the practicality of the methods examined for largescale classroom use. We therefore included here a comparison of two parallel
methods, identical except that one is administered to pairs of students by a
researcher while the other is administered to a whole class by the classroom
teacher (with some researcher assistance).
Method
Participants
The participants were from a public school in a low-income neighborhood
in the Bronx in New York City. There were a total of 72 students (38 females)
from three sixth-grade and one seventh-grade science classes, all taught by the
same teacher. Participating in pairs in a pair intervention condition were 25
students (12 females) from one sixth-grade classroom. Because of the odd
number of students in this group, one student participated in the intervention
7
alone. A second sixth-grade class of 24 students (12 females) participated in

the same intervention as a whole class. Twenty-three students (14 females)
drawn equally from another sixth-grade class and a seventh-grade class served
as a control group and received only the post-intervention assessments.
The school population is one percent Asian, 19% African American, 79%
Hispanic and one percent Caucasian. The majority of the students are from
academically and socioeconomically disadvantaged backgrounds, with about
72% qualifying for free or reduced-price lunch. Only about 15 % of students
performed at grade level on recent ELA and Math standardized proficiency tests.
A 10-item written multiple-choice test (Jewett & Kuhn, 2015) of a type
commonly used for this purpose and administered at the beginning of the
school year confirmed that students showed little mastery of the control of
variables (COV) strategy, with a majority of students scoring no more than 50%
correct and scores of 100% correct rare, a finding consistent with others for this
population. Group comparisons were nonsignificant.
Intervention procedure
The content of the intervention was identical across conditions except
that the classroom group participated as a whole class led by the classroom
teacher and assisted by the second author (whose presence enabled her to
confirm fidelity of implementation) and assistant, while in the pair condition the
pair worked in a corner of the classroom with a facilitator (the second author)
throughout the intervention. In the classroom condition, students worked with a
partner for most activities.
The intervention was administered to the classroom group over a
sequence of 10 45-min class sessions that took place over a period of 16 days.
The number of sessions in the pair condition varied as it was tailored to
students progress. Sessions averaged about 24 minutes (allowing two per class
period). Of 13 pairs, one completed the intervention in four sessions and one
completed it in five, while most of the rest took six sessions and four pairs
8
required seven sessions. These sessions took place over an average of 32 days,
with a range from 14 to 59.
At the first session, the activity was introduced as follows, illustrated by
an accompanying Powerpoint graphic:
A new Astro-World Foundation, funded by some wealthy businessmen,
wants to provide money for a space station. Groups of young people would
live there for several months. Many young people have applied. The
Foundation president needs to choose the best ones. So she asked some
applicants to spend a week in a space simulator (picture is shown and
function explained). She had background information about each
applicant, and each one got a rating on how well they survived in the
harsh conditions of the simulator. Some did fine; others okay, and some
became sick and had to leave.
Based on these records, she can decide which things are important to ask
new applicants about and which ones arent. Some of the factors, she
noticed, made a big difference to how well an applicant did, some made a
small difference, and some made no difference. She found out, for
example, that body weight made no difference: Heavy people did as well
in the simulator as light ones. But other things about people seemed to
make a big difference in how well they did. So now, when she chooses final
groups of astronauts to go on the real trips, shell have a better idea what
things to find out about applicants, so she can be pretty sure how an
applicant will do and shell be able to choose the ones who will do best.
But, in order to be sure, shes asked for our help in analyzing their results.
Which things are worth asking applicants about and which dont make any
difference, like body weight? There are a lot of things that we can ask
about but the foundation cant ask about everything. It would take too
long. If we know what to ask applicants, we can choose the best team of
astronauts.
Here are four things that the foundation thought might make a difference
to how well people do in the simulator: Fitness - does how well the person
can run or do other exercises matter? - Family size - does the size of the
family the person grew up in matter? - Education - does how much
education a person has matter? - and
Parents health - does the health of the persons parents matter? All the
applicants seem healthy, but maybe their parents health might say
something about how healthy they will turn out to be.
Will you help figure out which things are worth asking the applicants about
and which ones dont matter? Then you can predict how well theyll do and
choose the best ones for the team. Later, you can compare your results
with those of your classmates and see who chose the best-performing
astronaut team.
Following this introduction, pairs were asked to record on a form which of the
four factors they thought would and would not matter. In the classroom
condition, a tally across the class was shown, and in both groups it was noted
that opinions differ.
Control of variables phase. This phase was introduced by the adult
(teacher or facilitator, depending on condition) saying, These are only opinions
and what someone thinks. Now, lets look at the data to find out what actually
does matter and whether your hypotheses were right. A general reminder of
the larger purpose of the activity was then provided and was repeated
periodically throughout the activity (a minimum of once per session):
Remember, the goal is to figure out what matters to how well people do in the
simulator. Why do we want to know that? Because once we know what matters,
we can predict how well people will do. That way, we can pick the best team.
10
Students were then shown a set of 24 index cards, each containing the
applicants standing on the four factors and a space to record an applicants
performance rating in the simulator. Students were told that if they studied the
records carefully they could determine which factors make a difference to
performance and which dont. Students shared the set of cards with their
partner and were reminded that they needed to agree before making decisions
or drawing final conclusions.
The adult suggested studying one factor at a time. Students chose the
order of investigation in the pair condition; in the classroom condition it was
fixed (fitness, education, family size). Students were invited to choose from the
set the card(s) they would like to look at and to explain verbally (pair condition)
or in writing on a form for this purpose (classroom condition) what they would
find out from this choice of card(s). They then requested verbally (pair
condition) or in writing on a Data Request form (classroom condition) the
performance outcomes for their chosen applicants.
After recording these outcomes, pairs then had the option either to reach
a conclusion or to postpone concluding and seek further evidence by repeating
the preceding process. Once a pair was certain they had reached a conclusion
about a factors status, they could enter it on a Draft Memo form. In the pair
condition, if the conclusion was a valid one based on a controlled comparison,
the pair proceeded to choosing another factor to examine. If no controlled
comparison existed allowing a valid inference, the adult embarked on a
sequence of probe questions (see Appendix) whose purpose was to support
recognition of the limitation of the students investigative approach in not
yielding a definitive conclusion (e.g., Couldnt it also be the difference in
education thats leading to the different outcomes?). No superior approaches
were suggested, and scaffolding did not go beyond highlighting failure to
achieve the goal (a definitive conclusion). In the case of valid conclusions,
challenging probes were introduced (see Appendix), e.g., Suppose someone
disagrees with you and doesnt think that this factor makes a difference; what
could you tell them to convince them?
11
In the classroom condition, this probing could not be conducted

individually, but once per class session (typically at the end of the session) a
whole-class discussion occurred that followed this model, using one pairs work
as an example.
Once a pair in the pair condition and the majority of pairs in the classroom
condition had achieved three controlled comparisons (showing fitness, a twolevel factor, and education, the only three-level factor, effective and family size,
a two-level factor, ineffective), pairs completed their final memo to the
foundation director, indicating which factors applicants should be asked about
and which they should not and justifying their recommendations with results
from their investigations.
Multivariable coordination phase. Students at this point were ready to
transition to the next phase in which they represented and reasoned about the
influence of all of the factors operating at once. Students were introduced to
this phase thusly, So far all of our conclusions have been based on comparing
just two or maybe three cases. We would be more sure of our conclusions if we
looked at more than two cases at a time. We have a way to do that. Students
were then introduced to the representation of their data using charts generated
by the program InspireData and told, All of the cases that you and your
classmates have looked at before are here. It was explained that each diamond
represents a case and that the identity of that case can be seen by hovering
over the diamond (Figure 1).
Figure 1. InspireData Chart Showing All Cases
12
It was then illustrated that charts can be generated that separate cases
into different categories, for example in the display shown (Figure 2), only those
cases in which the applicants fitness was average rather than excellent are
shown. Students were then asked why it was that these applicants all of the
same fitness level showed a range of performance outcomes. With a little
prompting, students were able to generate the response that other factors
besides fitness were contributing to the outcomes.
Students were then shown a third display (Figure 3) in which all levels of
the fitness variable are included. They were asked to draw conclusions about
whether the factor makes a difference to the applicants performance. Given the
ability to see more data at once, students were asked to see if they reached the
same conclusions as they did earlier when comparing individual cases
presented on cards.
Students were then provided InspireData charts for each of five factors,
four introduced previously and one new one (home climate, a non-effective
factor), each of the same form as Figure 3, showing outcomes for all levels of
the factor. Students were reminded these charts would give them an
13
opportunity to verify their earlier conclusions. In their pairs, they did this and
then wrote memos to the foundation director confirming their earlier
conclusions based on a larger sample or revising their conclusions if they
thought necessary.
Figure 2. InspireData Chart Showing only Cases with Average Level of Fitness
Figure 3. InspireData Chart for the Fitness Factor
14
Prediction phase. Students were told that now that they had reached final
conclusions, they could try using them to evaluate a new set of applicants. They
would then be able to select a set of five to be chosen for the astronaut
program and compare their choices to those of their classmates. Students were
told that they could select up to four factors about the new applicants that they
could receive information on. As students were selecting the factors, the adult
reminded them to review the InspireData chart and consider whether
knowledge of status on this factor would be informative as to outcome. In the
classroom condition, a similar process took place at the whole-class level.
Information about 10 new applicants on four factors (including one noneffective one, whether or not it was asked for), data for each applicant
appearing on a separate card and cards presented one at a time. Students
completed the first prediction with guidance and then worked independently. In
addition to making each prediction, they were asked for each one, Which of the
four factors you have data on mattered to your prediction? Students were
encouraged to review the InspireData charts to doublecheck their decisions or
when there are disagreement between the student pair.
15
A final discussion occurred when pairs made their selections of the five
top-rated candidates and, in the classroom condition, shared these with the
whole class. This discussion included remembering the beliefs they had initially
held about the factors and noting that they would not have chosen the same
applicants before and after the analysis they had conducted.
Progression through the sequence. Students in the pair condition
progressed at their own pace, with a judgment made by the facilitator as to
when both members of the pair had a solid grasp of COV and appropriately
justified and defended their claims and were thus ready to move to the
multivariable phase; a similar judgment was made regarding progression from
the multivariable to prediction phases. Although this close monitoring was not
possible in the classroom condition, a paper-and-pencil assessment was
administered after the 8th class session to ascertain individual progress to this
point. The task required students to select a case to compare to one presented
to them, in order to determine whether one variable (fitness) made a difference.
Post-intervention assessment
All post-intervention assessments except one were conducted individually
and all were delayed in order to assess maintenance of achievements. Among
students in the pair condition these assessments occurred an average of 26
days following completion of the intervention (range 14 to 42 days). Among
students in the classroom condition they occurred an average of 32 days
following completion of the intervention (range 18 to 46 days). Assessments for
students in the control condition occurred during this same time period.
Maintenance and near transfer of skills in experimental design and control
of variables (COV). As the first component of the delayed post-intervention
assessment, a task related to the intervention activity was individually
administered to students in both intervention conditions. It was not
administered to students in the control condition as they were not familiar with
the intervention content. Its purpose was to assess the extent to which
16
experimental design and inference skills were maintained and transferred to

similar content. Two items were included, each introducing a new variable
(height and strength) into the astronaut scenario. The student was asked to
select up to two cases to test its effect.
Far transfer of design and COV skills. The far transfer task consisted of
three written items involving unfamiliar content but assessing the same skills
(design and interpretation of a controlled experiment) as the near transfer task.
One item, for example, stated that New York City was designing new cars for
their subway. Four pairs of two cars each were displayed, varying in car size and
number of wheels. The student was asked to select the best pair to build and
test if they wanted to know if the car size makes a difference to how fast the
train goes. This task was administered to students in all three conditions.
Multivariable analysis and prediction. This component, administered to all
students, was adapted from the one administered to adults by Kuhn, Ramsey,
and Arvidsson (2015). It was intended as a delayed transfer task to assess
maintenance and transfer to new content of the multivariable analysis and
prediction skills introduced in the intervention. The task presents (authentic
although simplified) data about factors (Employment, Family size, Education,
Home Climate) found to have an effect on average life expectancy across
different countries and one non-contributing factor (Country size). The student
is asked to predict life expectancy of additional countries based on information
provided about their status on the identified factors. The task also asks
respondents to indicate which factors they considered in their prediction. To
ensure students understood the task, they were given three practice items
before the six items that constituted the task.
Argumentation. The final component of the delayed post-intervention
assessment, administered to all students, is an elaboration of the cancer task
used by Kuhn et al. (2015). It consists of three parts, the first of which asks the
student to formulate a research question, while the second and third pertain to
counterargument and reconciling divergent conclusions. All are far transfer
17
tasks as they bear no surface similarity to and make no reference to the

intervention content.
1. Formulating a research question. Students were told, The Public
Health department of Portland, Ohio has noticed that the percentage of
residents diagnosed with cancer is much higher in the inner city than in
the outlying neighborhoods. The department is undertaking a study to
find out why there are more people getting sick with cancer in the inner
city than the outlying area. You have been assigned the job." The student
was asked what they would do to investigate the matter. Our interest here
was limited to the structuring of the research question rather than to any
procedural details. Of particular interest, given the multivariable emphasis
within the intervention, was whether a student would plan to investigate
only a single potential causal variable or would recognize multiple
potential contributors.
2. Evidence and counterargument. The second of the three tasks
was a continuation of the first one and presented immediately thereafter.
The student was told, John thinks it's because people in the city go to
tanning salons. What might someone say to John, if they think that john
was wrong? Choose the best evidence to show that John was wrong.
Presented verbally and as a written list were these four response options:
A. Air pollution is a more likely cause of cancer in the city
B. Many people outside the city also go to tanning salons and don't
get cancer
C. Many people who don't go to tanning salons also get cancer
D. There are more tanning salons outside the city than in the city.
Option B is the best answer because it refers to the presence of an
alleged causal antecedent in conjunction with absence of the alleged
outcome.1 Option C is consistent with alternative causes that may also
produce the outcome, as is Option A in more explicit form. (Option D is
not directly relevant to the claim.)
18
3. Reconciling claims. The final task continued the topic theme but
was presented in writing in the classroom on an occasion about one and a
half months after the other posttests, reducing possible influence of
students particular responses to the previous two tasks having the same
theme. Two potentially conflicting causal claims are now explicitly
presented and the question asks the participant how to interpret this
discrepancy:
You were hired by the Health Department to find out why people
living in the city of Logan, Georgia are getting cancer more often
than people who live outside the city. You tested and found out that
air pollution was worse inside the city than outside. You wrote a
report of your findings to the Health Department director, telling her
that air pollution was a likely cause of the increase in cancer.
She also got a report from another person she hired. This report said
that a likely cause of the cancer increase was not enough stores in
the city for people to buy healthy fruit and vegetables that lower risk
of cancer.
The director isn't sure what to conclude and she has written you
asking for advice. What would you write back? Give her the best
advice you can.
This question, then, unlike earlier ones, solicits reasoning not about the
claims themselves but rather meta-level reasoning about their status in
relation to one another and how the discrepancy between them is to be
understood a form of reasoning that is epistemological in nature and
central to scientific practice.
Results
Designing experiments and making inferences
19
Progress during intervention. Progress in varying the focal variable and

making controlled comparisons showed different patterns of change across
participants within the pair condition, with some achieving these benchmarks
sooner than others. The norm, however, as observed in earlier microgenetic
research (Kuhn, 1995; Siegler & Crowley, 1991) is a period of vacillation in use
of lower- and higher-level strategies after the higher-level strategy first appears.
Individual progress could not be tracked with precision in the classroom
condition, but the interim paper-and-pencil assessment administered after the
8th class session confirmed substantial progress with respect to COV. The task
required students to select a case to compare to one presented to them, in
order to determine whether one variable (fitness) made a difference. Of 22
students in the classroom condition who took the test, 5(23%) failed to vary the
focal variable, 8 (36%) varied the focal variable but showed no or partial control
of the remaining variables, and 9 (41%) chose a case that allowed a controlled
comparison.
Final post-intervention achievement, maintenance and near transfer. The
first component of the delayed post-intervention assessment was similar to the
interim assessment except that students were to produce the records on their
own. It was administered individually to students in both intervention conditions
to assess the extent to which they maintained and transferred experimental
design and inference skills (and specifically COV), The two items, each
introduced a new variable (height and then strength) and asked the student to
select up to two cases to test its effect. Performance is summarized in Table 1.
Table 1. Comparison of Intervention Groups on Maintenance and Near Transfer
of Design and Inference Skills
Never varied focal variable

Varied focal variable only sometimes
Consistently varied focal variable but
Pair condition
Classroom
1 (4%)
2 (8%)
2 (8%)
condition
3 (13%)
3 (12%)
6 (25%)
20
inconsistent control of other variables

Consistent controlled comparison
20 (80%)
12 (50%)
Note. N=25 for the pair condition and 24 for the classroom condition.
Assigning four points for a controlled comparison, three points for varying
a focal variable, and two points for including two cases for comparison, the
average scores across the two tasks for the pair group was 3.68 (SD=0.73) and
for the classroom group 3.23 (SD=0.9). This difference is not statistically
significant in an independent two-tailed t-test, t=1.83, p=.074. However, the
difference is significant when tested using the more powerful randomization t
test that requires no distributional or random sampling assumptions, the
WilcoxonMannWhitney (WMW) test, with a p value of .042. Also, the
proportion of students showing consistent controlled comparison is significantly
different across groups, p = .038, Fishers Exact test. For both groups, these
proportions are also significantly different from the proportions of students
showing consistent controlled comparison (i.e., 100% of items correct) on the
beginning of year assessment (Fishers Exact test).
Comparison of performance of the classroom group relative to their earlier
performance on the interim assessment indicates some continued progress
(reflected in fewer participants never varying the focal variable and more
showing consistent controlled comparison), although the switch from wholeclass to individual assessment is a likely contributing factor. In terms of final
achievement levels, the large majority of both groups consistently designed
comparisons that varied the focal variable. The majority of those in the pair
condition also consistently designed controlled comparisons, as did half of those
in the classroom condition, with the remainder doing so only inconsistently.
These proportions also differ significantly from performance at the beginning of
year assessment for both groups (Fishers Exact test).
Far transfer. The far transfer task, consisting of three written items
involving new content, assessed the same skills as the near transfer task.
Performance by condition is summarized in Table 2.
21
Table 2. Comparison of Groups on Far Transfer of Design and Inference Skills

Pair
Classroom
Control
condition
condition
condition
Did not consistently vary focal

variable
Consistently varied focal
3 (12%)
13 (54%)
14 (61%)
7 (28%)
4 (17%)
7 (30%)
variable but inconsistent

control of other variables
Consistent controlled
comparison
15 (60%)
7 (29%)
2 (09%)
Note. N=25 for the pair condition, 24 for the classroom condition and 23 for the
control condition.
Thus, comparing tables 1 and 2, to an approximately equal extent across
the two intervention conditions, fewer participants maintained controlled
comparison consistently when content was new and represented only in a
traditional paper-and-pencil format. Again assigning 4 points for a controlled
comparison and 3 points for varying a focal variable, mean score for the pair
group was 3.71(SD=0.51), comparable to the near transfer mean reported
above of 3.68. For the classroom group the mean was 3.10 (S.D. = .78), slightly
below the near transfer mean of 3.23. For the control group, the mean was 2.99
(S.D. = 0.63). The difference was significant for comparisons between the pair
condition and the classroom condition, t = 3.23, df = 39.59, p = .003, and
between the pair condition and the control group, t = 4.32, df = 42.43, p < .
001, but not between the classroom condition and the control group, t = 0.54,
df = 43.85, p = .591.
Multivariable analysis and prediction
22
Students overall did well on this task, indicating they understood the task
and were capable of performing it, yet there were significant differences in
performance across groups. In Table 3 appear mean prediction error scores over
the six items by group and the number of students in each group who showed
modal performance of zero error (a correct prediction). Of remaining students
who did not attain a modal performance level of zero error, all but one student
showed a modal level of 1, with the remaining student showing a modal level of
2. The maximum error score for each item was 2.
Table 3. Mean Prediction Error Scores and Modal Frequencies by Group
Pair
Classroom
Control
condition
condition
condition
0.16 (.23)
0.42 (.39)
0.67 (.28)
Prediction error mean

(SD)
Students showing modal
performance of zero error

19 (83%)
15 (63%)
5 (22%)
Note. Standard deviations in parentheses. Maximum prediction error = 2. N is
reduced to 23 in the pair condition due to an administrative error for two
participants; N=24 in the classroom condition and N=23 in the control
condition.
The pair condition showed significantly less error than the classroom
condition, t = 2.78, df = 37.70, p = .009. The classroom condition showed
significantly less error than the control group, t = 2.52, df = 42.05, p = .016.
In attributing factors as having influenced their prediction, students again
showed good performance overall but significant group differences (Table 4).
The pair group most often attributed influence to the four effective factors and
least often to the ineffective factor. The control group were less likely than the
pair group to attribute influence to each effective factor and more likely to
attribute influence to the ineffective factor, with the classroom group
23
intermediate but closer in performance to the pair group than to the control
group.
Table 4. Mean Number of Times (of 6) each Factor was Reported as Having
Influenced Prediction
Pair condition
Effective factors
Employment
Family size
Education
Climate
Ineffective
*5.61
*5.57
*5.87
*5.30
Classroom
Control
condition
condition
*5.00
*5.25
*5.04
*4.00
*3.91
*3.70
*4.04
2.25
factor
Country Size
0.04
1.21
1.74
Note. *Means of the contributing factor significantly different from the mean for
the non-contributing factor, country size. N=23 in the pair condition, 24 in the
classroom condition and 23 in the control condition.
Students were categorized based on the consistency of their attributions as
follows:
1-Chose only one but inconsistent factor across 6 countries
2-Chose only one consistent causal factor across 6 countries
3-Chose multiple but inconsistent factors across 6 countries
4-Chose multiple consistent (but not all four effective) factors across 6
countries
5-Chose four effective factors completely consistently across 6 countries
The mean difference between pair and classroom groups was significant, t
=3.32, df= 42.38, p=.002. The classroom group significantly outperformed the
control group, t =2.86, df = 37.87, p=.007, as did the pair group, t =7.73, df =
42.12, p<.001.
24
With respect to individual patterns of consistency in attributions, the same

pattern of group differences appears. Only one student in the pair group
attributed influence to the non-effective factor one time. In the classroom
group, one student did so one time, while seven (29%) did so multiple times. In
the control group, six (26%) students did so one time while 11 (47%) students
did so multiple times.
In the pair group, 18 students (72%) consistently attributed influence to
all four effective factors and never to the ineffective factor. In the classroom
group, only nine students (38%) showed this pattern, while 13 (54%) showed
inconsistency in attribution across cases (i.e., a factor is in some cases claimed
to have influenced a prediction and in other cases not).2 In the control group
only two students (09%) showed the consistent pattern with the remaining 21
(91%) showing inconsistency.
With regard to number of factors to which influence was attributed, 83%
(19) of the pair group, 54% (13) of the classroom group and 17% (4) of the
control group most frequently correctly attributed influence to four factors. Only
students in the control group22% (5)most frequently identified only a single
factor as responsible for the outcome. (Remaining students most often chose 2
or 3 factors.). Pair and classroom groups performed significantly better than the
control group in most frequently attributing to four factors with p<. 001 and
p= .015, respectively (Fishers Exact test).
Argumentation
Formulating a research question. Students from the two intervention
groups were roughly equally divided in this respect, with 43% of the pair group
(N=23) and 58% of the classroom group (N=24) identifying more than one
potential cause. Among the control group (N=23), only 30% did so. The
difference was not significant between the pair and control group but
approached significance between classroom and control group, p = .0798
(Fishers exact test).
25
Evidence and counterargument. Results for students in the three

conditions appear in Table 5. Shown are percentages of respondents choosing
each option as providing the strongest counterargument to the claim that
tanning salon use was a causal factor with respect to cancer rates.
Table 5. Percentages of students making different response choices regarding
counterargument
Pair
Classroom
Control
condition
condition
condition
4 (17%)
5 (21%)
6 (26%)
6 (26%)
10 (42%)
9 (39%)
13 (57%)
9 (38%)
8 (35%)
A. Air pollution is a more

likely cause of cancer in the
city 15
B. Many people outside the
city also go to tanning
salons and don't get cancer
25
C. Many people who don't
go to tanning salons also get
cancer.
D. There are more tanning
salons outside the city than
in the city.
0 (0%)
0 (0%)
0 (0%)
control condition.
As seen in Table 5, most respondents chose B or C with a smaller
proportion favoring A. All students appeared to recognize that D was irrelevant
to the claim and none of them chose that option. Among the three relevant
options, A, B, and C, Chi-square goodness of fit tests show that students
choices were significantly different from chance in the pair and classroom
26
conditions, X2(2)=26.42, p<.0001 and X2(2)=7.39, p=.025, respectively, but not

control condition, X2(2)=2.66, p=.264.
Reconciling claims. This question, unlike the two previous ones, solicits
reasoning not about the claims themselves but rather meta-level reasoning
about their status in relation to one another and how the discrepancy between
them is to be understood a form of reasoning that is epistemological in nature
and central to scientific practice.
We expected that answers to this question would give us insight into
students epistemological understanding regarding the nature of science, more
specifically the extent to which they understood it as an enterprise involving
competing claims whose status evolves as more evidence is brought to bear on
them.
Contrary to our expectations, a very large majority of participants did not
address the question as one of how the divergence in claims is best understood
and reconciled. Their answers thus did not bear directly on their understandings
regarding the nature of the scientific enterprise, except in the negative sense of
their not seeing the divergence as warranting attention or interpretation.
Instead, students dominant stance was one of how to use toward practical ends
the information that had become available. Their understanding of the nature
and objectives of science, in other words, remained one common among this
age group producing good outcomes rather than analysis of causes and effect.
Thus, the objective students identified in the context of this question was
to reduce cancer. A few students referred to a solution as simply continuing to
search for a cure, either through research find more cures for cancer - or
appeal to experts find someone with an answer or personal observation
She should go to both places to try to find out herself. The large majority,
however, made reference to one or both of the causes suggested, but rather
than addressing the discrepancy between them, the perceived objective
continued to be reducing cancer for example, contact all the factories in
Georgia to cut down on all the polluted air or Add more stores to buy more
27
fruits and vegetables and use less fire because it releases smoke & causes air
pollution.
None of these responses raise questions about whether the validity of the
causal claims should be evaluated (the second of which made no reference to
supporting evidence), rather than their being simply put into action. Across the
entire sample, there occurred only six exceptions, five from students in the
classroom intervention condition and one in the control condition. One of these
noted that the claims were not mutually exclusive I would advise that both
could be possible. The others sought to compare likelihoods of their
correctness, two drawing on their own knowledge Maybe air pollution is
worse than not enough stores. Theyre both important but pollution in the air is
important because what we are breathing is air, or It cant be lack of fruit
because some people dont eat fruit and dont get cancer. The remaining three
suggested methods for doing so e.g., Ask people that have cancer if they eat
healthy. If they say yes they probably got cancer from air pollution.
Discussion
In light of a history of difficulty in effecting advances in higher-order
thinking skills in the disadvantaged, low performing population studied here, the
goal of the present study was essentially met. Both intervention groups showed
notable although far from perfect mastery of COV relative to a control group,
yet with diminished maintenance in tests of far transfer following delay, among
the pair group as well as the classroom group, for whom the intervention
appeared less effective. With 80% of students in the pair condition and 50% in
the classroom condition consistently showing controlling across multiple tasks,
these results compare favorably to previous efforts with similar populations that
address only the COV strategy (Lorch et al., 2014; Siler et al., 2010).
Vacillation in use of new strategies, including COV, during a period when
they are first being acquired has been widely reported in microgenetic studies
28
(Kuhn, 1995; Siegler & Crowley, 1991). As commonly found, students in the
present study displayed more consistency in transfer tasks with familiar than
with unfamiliar material. An identical number of students in both intervention
groups (5 students in each, about 20%) did not maintain their previously
consistent use of their experimental design skills when they transitioned from
familiar (near test) to new (far test) content, becoming less consistent in its
application. The majority (60%) of the pair group maintained consistent control,
while slightly less than a third (29%) of the classroom group did so. The
difference in performance of the two groups (about 30 percentage points) thus
remained equivalent at the two assessments. These findings are consistent with
the view that continued and varied experience across a succession of domains
is necessary if consolidation of higher-level cognitive strategies (as assessed in
tests of maintenance and far transfer) is to be achieved.
With respect to the less studied skills entailed in coordination of effects of
multiple variables, as assessed in the Life Expectancy assessment, both
intervention groups displayed considerable mastery, especially in relation to the
far from optimal performance shown by adults on this task (Kuhn, Ramsey, &
Arvidsson, 2015) as well as the present control group.
The broad objectives of the present work were thus met, in establishing
that it is feasible using the methods we did to engage low-achieving,
underserved middle-school students in activities that introduce them to
practices of science, following which they show measureable benefits, with a
majority continuing to evidence understanding of and facility in these practices
several weeks later. In other recent work (Ramsey, 2014), there exists evidence
of experience in similar activities extending to a more traditional physical
science topic (weather patterns).
Results are more mixed with regard to argumentation and to developing
meta-level thinking about argument and related understandings of nature of
science. The intervention activities involved not only a repeated requirement to
justify ones claims with appropriate evidence, but also repeated engagement
with the probe, Suppose someone disagreed with you which we expected to
29
afford exercise in defending and supporting alternative claims using evidencebased arguments. This expectation was largely met within the intervention.
Students for the most part with practice became successful in meeting these
demands and most often did so confidently by means of direct reference to
evidence, e.g., Id show them the results on the chart.
Where students demonstrated less success was in extending these
competencies to meta-level reflection on argumentation in new contexts
unrelated to the intervention. We expected the intervention to support students
understanding of evidence as the strongest arbiter of divergent claims and to
thus lead them to reason about instances of such divergence accordingly. These
expectations were largely not realized when the assessment was extended to
new contexts not connected to the intervention. The post-intervention
argumentation assessments did suggest some achievement with respect to
recognizing multiple contributors to an outcome, consistent with the findings of
Kuhn, Ramsey and Arvidsson (2015) in a multi-year intervention. In the
counterargument assessment, however, only a slight difference across groups
appeared, with control group students more often mistaking the existence of an
alternative contributor (option A) as evidence against the role of the contributor
being examined. No group differences appeared, however, in students
recognizing absence of the outcome in the presence of the antecedent (option
B) as the strongest evidence to use in counterargument (the option adults are
most likely to choose), and they were just as likely to choose the absence of the
antecedent in the presence of the outcome (option C) as strongest (despite its
consistency with an alternative factor having produced the outcome).
Finally, students showed least progress in the group-administered
assessment asking them to account for divergent claims in this context
unconnected to the intervention. Only a small proportion of students undertook
to do so. The large majority did not treat divergent causal claims as a cause for
attention, examination, or attempted reconciliation. Instead, without
acknowledging the divergence, they focused on one or the other or both
imputed factors as causes worthy of action without further investigation.
30
In hindsight, we perhaps should not have expected to see significant

gains in meta-level understanding of argumentation and in epistemological
understanding related to nature of science in an intervention of no more than
10 sessions. This is even more the case given the population we worked with,
whose academic experiences and values are at best limited and even negative.
Other authors have noted the long-standing difficulty of fostering progress in
understandings of nature of science (Carey & Smith, 1993), and reports of
advances in this respect following modest interventions are rare. It thus appears
likely that more than the amount of engagement we provided is required if
novice students are to recognize a divergence of claims as something needing
to be examined and evaluated as an undertaking central to the practice of
science. There exists evidence of this progress being more readily achieved in a
social than physical science domain (Kuhn, 2010), but experience across a
broad range of contexts is more likely to be effective than experience in only
one.
The picture is more positive with respect to developing skill in
argumentation, contrasted to meta-level understanding regarding argument.
However, again, it does not appear to happen quickly, as the result of brief
interventions. In a dialogic approach to developing skills of argumentation (Kuhn
& Crowell, 2011; Kuhn, Hemberger, & Khait, 2015), middle-school students did
show progress in producing counterargumentation (Crowell & Kuhn, 2014) and
in evaluating it (Kuhn, Zillmer, Crowell, & Zavala, 2013), but these advances
generally were not evident until the second year of a twice-weekly intervention
that engaged students in goal-based activity.
Others have emphasized how critical it is that students develop their
understandings of the nature of science by engaging in it (Sandoval, 2005), and
that students understand the purpose of the activities they engage in if they are
to gain anything from their engagement (Berland & Hammer, 2012). The nature
of science, most science educators now agree, cannot be taught in a deep way
through direct instruction and rather must be experienced in a context of
meaningful activity. As students engage in a larger number of purposeful, goal31
directed activities that involve science practices, over a wide range of content,
they are in the best position to extract some general attributes of these
practices and to appreciate their value. This experience can only accrue
gradually. As well as the skill components involved in coordinating evidence and
claims in the service of argument, scientific practice encompasses the values
and norms that come from participation with others in a community that
upholds shared standards of knowing (Manz, 2014; Sandoval, 2015).
We turn finally to the comparison of our two intervention groups, whose
performance showed notable differences. A major difference in implementation
of the intervention was not only the whole-class vs. pair setting but the time
frame the classroom group participated for ten 45-min class sessions, whereas
the pair group participated on average for the equivalent of only three such
class sessions (an average of six 24-min sessions), less than a third as much
time invested. It was even the case, moreover, that the classroom group
showed a gain in COV from their interim performance on a whole-class
assessment to performance assessed individually several weeks after the
intervention, suggesting the importance of simply this difference in assessment
conditions.
Overall it was the individually instructed group who fairly consistently
showed superior performance, despite the lesser time invested. This outcome,
we believe, is one that has important practical implications, in speaking to the
value of individualized instruction, in particular for the population we studied.
The classroom group was able to make progress, but less efficiently, in large
part, we believe, due to a) students individual claims not being directly
challenged (instead only less directly so during whole-class discussion); b) the
attitude we observed in the classroom that only intermittent attention was
necessary to keep pace with what was taking place; and c) the classroom time
that routinely needed to be diverted to gaining and regaining students
attention. In the pair condition, in contrast, the pair was in constant interaction
with each other and the adult and required to think about and justify whatever
they said at just the time they said it. In current work, we are therefore
32
exploring ways to make the protocol shown in the appendix automated in a way
that could make it practical for large-scale use. This is of course a sizeable step
from personal conversation between peers and a more knowledgeable
facilitator, but we believe it may be one worth pursuing, especially in seeking to
reverse the long-standing lack of success that continues to be widely observed
among the population studied here.
33
Footnotes
1. In a sample of 16 educated adults, 8 (50%) chose B, while 3 (19%) chose
both B and an additional option.
2. The remaining two students in each intervention group were consistent but
attributed influence to only 3 of the 4 factors.
34
References
Applebee, A. N. (1996). Curriculum as conversation: Transforming traditions of
teaching and learning. University of Chicago Press.
Berland, L. K., & Hammer, D. (2012). Students framings and their participation
in scientific argumentation. In xxxxxxx (Ed.), Perspectives on scientific
argumentation (pp. 73-93). Springer Netherlands.
Carey, S., & Smith, C. (1993). On understanding the nature of scientific
knowledge. Educational psychologist, 28(3), 235-251.
Crowell, A., & Kuhn, D. (2014). Developing dialogic argumentation skills: a 3year intervention study. Journal of Cognition and Development, 15(2),
363-381.
Ford, M. J. (2012). A dialogic account of sense-making in scientific
argumentation and reasoning. Cognition and Instruction, 30(3), 207-245.
Forman, E. A., & Ford, M. J. (2014). Authority and accountability in light of
disciplinary practices in science. International Journal of Educational
Research, 64, 199-210.
Jewett, E., & Kuhn, D. (2015). Problem-based Learning as a Tool in Developing
Higher-order Intellectual Skills in Low-achieving Students. Manuscript
under review.
Kelly, G. (2008). Inquiry, activity and epistemic practice. In R. Duschl, & R.
Grandy (Eds.), Teaching scientific inquiry: Recommendations for research
and implementation (pp. 99117). Rotterdam, The Netherlands: Sense
Publishers.
35
Kuhn, D. (1995). Microgenetic study of change: What has it told us?

Psychological Science, 6, 133-139.
Kuhn, D. (2010). Teaching and learning science as argument. Science Education,
94, 810-824.
Kuhn, D. (2011). What is scientific thinking and how does it develop? In U.
Goswami (Ed.), Handbook of childhood cognitive development. Oxford:
Blackwell. (2nd ed.)
Kuhn, D., & Crowell, A. (2011). Dialogic argumentation as a vehicle for
developing young adolescents thinking. Psychological Science, 22(4),
545-552.
Kuhn, D., & Dean, D. (2005). Is developing scientific thinking all about learning to
control variables? Psychological Science, 16, 866870.
Kuhn, D., & Dean, D. (2008). Scaffolded development of inquiry skills in
academically-disadvantaged middle-school students. Journal of the
Psychology of Science and Technology, 1, 36-50.
Kuhn, D., Hemberger, L., & Khait, V. (2015). Argue with me: Argument as a path
to developing students thinking and writing. New York: Routledge. (2nd ed.)
Kuhn, D., & Pease, M. (2008). What needs to develop in the development of
inquiry skills?.
Cognition and Instruction, 26(4), 512-559.
36
Kuhn, D., & Pease, M. (2009). The dual components of developing strategy use.
In H.S. Waters & W. Schneider (Eds.), Metacognition, strategy use, and
instruction. New York: Guilford Press.
Kuhn, D., Ramsey, S., & Arvidsson, T. S. (2015). Developing multivariable
thinkers. Cognitive Development, 35, 92-110.
Kuhn, D., Zillmer, N., Crowell, A., & Zavala, J. (2013). Developing norms of
argumentation: metacognitive, epistemological, and social dimensions of
developing argumentive competence. Cognition and Instruction, 31(4),
456-496.
Lazonder, A., & Kamp, E. (2012). Bit by bit or all at once? Splitting up the inquiry
task to promote children's scientific reasoning. Learning and Instruction,
22, 458-464.
Lehrer, R., & Schauble, L. (2006). Scientific thinking and scientific literacy:
Supporting development in learning contexts. In K. A. Renninger & I. Sigel
(Vol. Eds.) & W. Damon (Series Ed.), Handbook of Child Psychology. Vol. 4.
Hoboken, NJ: Wiley. (6th ed.)
Lehrer, R., & Schauble, L. (2015). The development of scientific thinking. In L.
Liben (Vol. Ed.) & R. Lerner (Series Ed.), Handbook of Child Psychology
and Developmental Science. Vol. 2. Hoboken, NJ: Wiley. (7TH ed.)
Lorch Jr, R. F., Lorch, E. P., Calderhead, W. J., Dunlap, E. E., Hodell, E. C., & Freer,
B. D. (2010). Learning the control of variables strategy in higher and lower
achieving classrooms: Contributions of explicit instruction and
experimentation. Journal of Educational Psychology, 102(1), 90-101.
37
Lorch Jr, R. F., Lorch, E. P., Freer, B. D., Dunlap, E. E., Hodell, E. C., & Calderhead,
W. J. (2014). Using valid and invalid experimental designs to teach the
control of variables strategy in higher and lower achieving classrooms.
Journal of Educational Psychology, 106(1), 18-35.
Manz, E. (2014). Representing Student Argumentation as Functionally Emergent
From Scientific Activity. Review of Educational Research, xxxxxxxxxxxxxx
Osborne, J. (2014). Teaching scientific practices: Meeting the challenge of
change. Journal of Science Teacher Education, 25(2), 177-196.
Ramsey, S. H. (2014). How Do We Develop Multivariable Thinkers? An
Evaluation of a Middle School Scientific Reasoning
Curriculum. Unpublished Ph.d. dissertation, Teachers College Columbia
University.
Rudolph, J. L. (2014). Deweys Science as Method a Century Later Reviving
Science Education for Civic Ends. American Educational Research Journal,
51, 1056-1083.
Sandoval, W. (2005). Understanding students practical epistemologies and their
influence. Science Education, 89, 634656.
Sandoval, W. (2014). Science educations need for a theory of epistemological
development. Science Education, 98, 383-387.
Sandoval, W. (2015). Epistemic goals. In R. Gunstone (Ed.), Encyclopedia of
Science Education (pp 393-398). Dordrecht: Springer Netherlands.
38
Schauble, L., Klopfer, L. E., & Raghavan, K. (1991). Students' transition from an
engineering model to a science model of experimentation. Journal of
Research in Science Teaching, 28(9), 859-882.
Siegler, R., & Crowley, K. (1991). The microgenetic method: A direct means for
studying cognitive development. American Psychologist, 46(6), 606-620.
Siler, S., Klahr, D., Magaro, C., Willows, K., & Mowery, D. (2010, January).
Predictors of transfer of experimental design skills in elementary and
middle school children. In Intelligent Tutoring Systems (pp. 198-208).
Springer Berlin Heidelberg.
Zimmerman, C. (2007). The development of scientific thinking skills in
elementary and middle school. Developmental Review, 27, 172223.
39
Table 1. Comparison of Intervention Groups on Maintenance and Near Transfer

of Design and Inference Skills
Never varied focal variable

Varied focal variable only sometimes
Consistently varied focal variable but
Pair condition
Classroom
1 (4%)
2 (8%)
condition
3 (13%)
3 (12%)
inconsistent control of other variables

2 (8%)
6 (25%)
Consistent controlled comparison
20 (80%)
12 (50%)
Note. N=25 for the pair condition and 24 for the classroom condition.
40
Table 2. Comparison of Groups on Far Transfer of Design and Inference Skills

Pair
Classroom
Control
condition
condition
condition
Did not consistently vary focal

variable
Consistently varied focal
3 (12%)
13 (54%)
14 (61%)
7 (28%)
4 (17%)
7 (30%)
variable but inconsistent

control of other variables
Consistent controlled
comparison
15 (60%)
7 (29%)
2 (09%)
control condition.
41
Table 3 Mean Prediction Error Scores and Modal Frequencies by Group

Pair
Classroom
Control
condition
condition
condition
0.16 (.23)
0.42 (.39)
0.67 (.28)
Prediction error mean

(SD)
Students showing modal
performance of zero error

19 (83%)
15 (63%)
5 (22%)
Note. Standard deviations in parentheses. Maximum prediction error = 2. N is
reduced to 23 in the pair condition due to an administrative error for two
participants; N=24 in the classroom condition and N=23 in the control
condition.
42
Table 4 Mean Number of Times (of 6) each Factor was Reported as Having
Influenced Prediction
Pair condition
Effective factors
Employment
Family size
Education
Climate
Ineffective
*5.61
*5.57
*5.87
*5.30
Classroom
Control
condition
condition
*5.00
*5.25
*5.04
*4.00
*3.91
*3.70
*4.04
2.25
factor
Country Size
0.04
1.21
1.74
Note. *Means of the contributing factor significantly different from the mean for
the non-contributing factor, country size. N=23 in the pair condition, 24 in the
classroom condition and 23 in the control condition.
43
Table 5 Percentages of students making different response choices regarding

counterargument
Pair
Classroom
Control
condition
condition
condition
4 (17%)
5 (21%)
6 (26%)
6 (26%)
10 (42%)
9 (39%)
13 (57%)
9 (38%)
8 (35%)
A. Air pollution is a more

likely cause of cancer in the
city 15
B. Many people outside the
city also go to tanning
salons and don't get cancer
25
C. Many people who don't
go to tanning salons also get
cancer.
D. There are more tanning
salons outside the city than
in the city.
0 (0%)
0 (0%)
0 (0%)
control condition.
44
Figure 1 InspireData Chart Showing all Cases
45
Figure 2 InspireData Chart Showing only Cases with Average Level of Fitness
46
Figure 3 InspireData Chart for the Fitness Factor
47
Appendix: Intervention Protocol

Phase 1. Design and Interpretation of Experiments
Here are the applicants records on cards. (Facilitator displays cards and
explains how to read a record. Each card shows background information about
this applicant with a blank space for the applicants performance rating in the
simulator to be recorded.) Studying the records carefully, you will be able to
learn see which factors make a difference to performance and which dont.
You and your partner will share the records, and one rule is that everything you
do, you have to agree to first. (Facilitator is to ensure that pairs are discussing
their actions and decisions with each other and that each student is paying
attention to what the partner says.)
Youll do best if you investigate one factor at a time. Which factor do you want
to start with? You can start with one you think is going to make a lot of
difference, but eventually youll investigate them all.
(Pair discusses and responds.)
(Facilitator displays phase 1 factors list on a stand and puts an arrow post-it on
the factor pair has selected for investigation. As needed, she points to the stand
to remind pairs which factor they are investigating.)
(For the first factor only) Great, now choose one or two records you want to look
at first. If you want to find a particular case, let me know and I can help you find
it. Remember to discuss and agree before you make your choices. Once you
have decided on the records to look at, I will go to the database to find the
performance outcome for that applicant for you. You can record it on that record
card. (Pair is shown where to write down the outcome on the record card.)
(Pair discusses and agrees on choice of record card(s) from the set.)
48
A. (If pair chooses only one record) What are you going to find out from this
record? (Pair responds.)
(Facilitator provides the outcome for the chosen record. Pair records the
outcome on the record.)
What did you find out? (Pair responds.)
Can you conclude whether [factor being investigated] makes a difference to the
performance in the simulator? (Pair responds.)
1. (If pair says yes, they can tell whether factor makes a difference)
What will happen to the outcome for this applicant if [factor being investigated]
goes up? (Pair responds based on belief.)
Do you know for sure? Why dont you test this out to be sure? What cards would
you need to test this out?
2. (If pair says no) What cards would you need in order to be able to find out
whether this factor makes a difference? (Pair responds.)
(If pair does not suggest finding another record to compare) What will happen
to the outcome for this applicant if [factor being investigated] goes up? (Pair
responds based on belief.)
Do you know for sure? Why dont you test this out to be sure? What cards would
you need to test this out?
B. (If pair chooses two records) What are you going to find out by comparing
these two records? (Expected response: whether X makes a difference to the
outcome.)
(If pair answers anything else, guide pair to only think about finding out
whether [factor being investigated] makes a difference to the outcome.)
(Provide pair the outcome for the chosen record. Pair records the outcome on
the record.)
49
(Pair examines the two records side-by-side, with outcomes shown.)

What did you find out? (Pair responds.)
(Facilitator to repeat pairs claim so that all parties are clear about the claim the
pair is making.)
1. (For causal claim, if not controlled) That may be true. But can we really
tell for sure that [factor being investigated] makes a difference to the
performance? (Point to another factor on one of the cards) Couldnt someone
say that it is because applicant A has [certain level] for [another factor] and that
is why applicant A has a better grade than applicant B? Do you really know for
sure that [factor being investigated] made the difference?
a. (If pair says, no, you cannot be sure) How can we be 100% sure? Is
there a better record to compare applicant As record to? (Pair chooses a new
record to compare with A, and is provided with outcome information.)
b. (If pair says, yes, you can be sure, ask the Fallback Questions at the
end of this section.)
2. (For non-causal claim, if not controlled) That may be true. But [factor
being investigated] and [another factor] for the two cards are different but the
outcome is the same. Can you really know why their outcomes are the same?
Maybe one makes the outcome go up and the other makes it go down so they
offset each other? Can you really be sure that [factor being investigated] does
not make a difference? (Pair responds.)
a. (If pair says, no, you cannot be sure) Is there a better record to
compare it to so that we can be 100% sure? What cards would you need to see
instead so that you can be sure?
b. (If pair says, yes, you can be sure, ask the Fallback Questions at the
end of this section.)
50
3. (If records were controlled) What will outcome be when [factor being
investigated] changes from one level to another? (Pair responds. If necessary
facilitator reviews what it means that a factor makes a difference.)
Make an argument to the foundation of how youre sure [factor being
investigated] is/isnt important. (Pair responds.)
But couldnt I say that applicant A has [certain level] for [another factor] and
that is why applicant A has a better grade than applicant B? (Expected
response: no, because applicant B also has the same level for [the other factor]
as applicant A. So it cannot be [the other factor] that is making the difference in
performance. If student does not provide the expected answer, guide the
partner to challenge the students response. If both students fail to note that
[the other factor] was the same for both records, facilitator asks this question
again when investigating the next factor and points specifically to the records.)
Suppose someone disagrees with you and doesnt think that [factor being
investigated] does/does not make a difference; what would you say to them to
convince them? (Pair responds. Eventually, expected response: show them the
cards.)
Lets write down what you found out here as a memo to the foundation. (Gives
pair a memo form)
(Once pair has successfully controlled for one or two factors) What if we change
your comparison so that [another factor] also differs. Can you still use this
comparison to show that [factor being investigated] makes a difference to the
outcome? (Pair responds.)
Why is this comparison not convincing? (Pair responds.)
Now choose another factor to investigate.
4. (Fallback Questions for if, after a few tries, pairs still do not choose
controlled records) What do you think would happen to applicant As
51
performance, if, for [factor being investigated], she has a different level? (Pair
makes a guess of performance.)
Should we find out what the records show? Which record do we need? (Pair
responds.)
a. (If pair still does not choose a controlled record) What would applicant
As record look like if she has a different level for [factor being investigated]?
(Pair responds.)
Lets look for that card so that we know what happens to applicant As
performance if she has a different level for [factor being investigated].
Phase 2. Multivariable Coordination

So far our conclusions are all based on just two cases. We would be more sure
of our conclusions if we looked at more cases. We have a way to do that.
(Facilitator introduces InspireData chart. She discusses how to read a chart,
showing how a record is represented by one diamond by showing the details of
a diamond and highlighting the fact that the information is the same as one
record card in Phase 1.)
Here, we collected a lot of records. Look just at the average level for fitness.
Why is there a range of different outcomes, even though everyone is average
on fitness? (Expected response: Because other factors also matter as well.
Facilitator shows the details of some records with different performance levels
and points out that, except for fitness, the records have different levels for
other factors.)
So, all these other factors may also make the outcome change. Thats why we
collected lots of cases and we have the same number of records for all different
levels. For example, we have the same number of records of people who have
52
no college, some college and college education. Lets see if you come to the
same conclusion as you did earlier now that you are looking at more results.
Which factor do you want to look at first using the charts? (Pair responds.)
(Shows chart for [factor being investigated].)
So, remember, other things may be contributing as well. But can we say
OVERALL that [factor being investigated] makes a difference?
A. (If pair refers to beliefs or Phase I comparisons) Remember that the
organization wants to be really sure before we make any decisions. So we dont
want to rely on opinions/just two cases.
What does the chart say about whether the factor matters or not to the
performance?
B. (If pair is unsure how to tell if [factor being investigated] makes a
difference) Lets look at the chart together. What do you notice about the
performance levels when the applicants have [a certain level, e.g. average] and
what about when they have [a different level, e.g., excellent] for [factor being
investigated]? (Pair responds.)
Do the applicants have different or similar kinds of performance? What does
that tell you about whether [factor being investigated] makes a difference to
the performance?
C. (If pair correctly compares the distributions) Make an argument to the
foundation of how you are sure [factor being investigated] is/isnt important.
(Correct response: by comparing the 2 distributions)
Suppose someone disagrees with you, what would you say to them to convince
them? (Pair responds. Expected response: show them the chart.)
53
How would the graph look different if [factor being investigated] makes no
difference to how well people do?
Lets write down what you found out here as a memo to the foundation. (Give
pair a memo form.)
Now choose another factor to investigate.
(Once all factors have been examined) Now, lets make a summary for what you
have figured out. (Facilitator assists as needed as pair completes summary
memo.)
Phase 3. Prediction
Youve worked hard to figure out what factors mattered to the applicants
performance. Now, you can use your knowledge to predict how well they will do.
I have some new applicants here. Can you predict how well they will do? Then
you can choose a best team of five.
What information about the applicant would you need to predict his/her
performance? I can only tell you up to four things about the applicant. Which
factors do you want to know about? Look at the list of factors here; what
information about them would you like to have? You can make use of the
summary sheet or refer to any of the charts anytime. (Expected response: all
causal factors. If not all causal factors are requested, urge pairs to review each
chart and if necessary, repeat phase 2 protocols.)
Here is the information we have on this applicant. (Facilitator provides data for
all requested causal factors and, unless pair chose a non-causal factor, also
include Home Climate, a non-causal factor, for the applicant)
54
Now you can predict. Predict how each one will do, and explain why you made
that prediction. Be sure to discuss with your partner before making a final
decision.
(Applicant description sheets are presented, one at a time, with charts handy so
that pair can refer to them when needed.)
(For the first three predictions, facilitator asks, Which factors mattered to you
prediction?)
A. (If pair selects a non-causal factor as influencing the prediction)
What did you find out about [the non-effective factor]? Did it make a difference
to the outcome?
1. (If pair answers yes) How did you know? (Pair responds based on belief)
Do you know for sure? Lets look back at the chart?
2. (If pair answers no) When you predict how well someone will do, will it help
you to know whether they have [a certain level] or [another level] on this
factor?
a. (If pair answers yes) Do you know for sure? Why dont you check out the
chart?
(Show chart of the non-causal factor)
What do you think will happen to the applicants performance if for [the noneffective factor], they go from having [a certain level] to [another level]? (Pair
responds.)
i. (If pair answers no change) Do you still need to know about this factor to
make your prediction?
ii. (Otherwise) Lets find out. (Points to chart) Does it matter to the
performance whether [the non-effective factor] is [a certain level] or [another
level]? (Pair responds.)
55
Do you still want to know about this factor to make your prediction?
B. (If pair does not select one or more effective factors as influencing
the prediction) What made you predict this performance level? (Pair response:
because [a causal factor] is at a certain level)
So you are saying that because [a causal factor] is at [a certain level] thats why
you think this will be the performance level? What about factor [a causal factor
not selected]? Did you find out whether [the causal factor not selected] makes a
difference? (Pair responds.)
1. (If pair says yes) Does the applicants performance go up when [the causal
factor not selected] change from one level to another? (Pair responds.)
a. (If pair says yes) Then, when someone has a hi level of [the causal factor
selected] and [the causal factor not selected], what happens to the applicants
performance? (Expected response: the performance goes up even more.)
(If needed) Will the hi level on [the causal factor not selected] make it go up
even more than if it was just [the causal factor selected] affecting it?
b. (If pair says no) Lets look at the chart for [the causal factor not selected].
(Once all predictions have been made)
Now, choose from all these applicants. Which five should be chosen for the final
team? Discuss until everyone agrees.
Now, lets look at the ones you have chosen.
(Review each of the chosen applicants and compare their predicted
performances. For the ones with different performances, ask pairs to review
their predictions side by side.)
You gave these applicants lower grade, can you explain why?
(Repeat until all of the predictions of the chosen five applicants have the
appropriate
56

C

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C

Uploaded by

Copyright:

Available Formats

Beyond Control of Variables:

Introducing Academically Disadvantaged Young Science Students

Deanna Kuhn and Toi Sin Arvidsson

Author contact: dk100@tc.columbia.edu

Beyond Control of Variables:

The widening gulf between the advantaged and disadvantaged segments

students to science as a multifaceted, comprehensive practice is possible even

In the present study, we worked with students from a similar population

alone. A second sixth-grade class of 24 students (12 females) participated in

In the classroom condition, this probing could not be conducted

Figure 3. InspireData Chart for the Fitness Factor

experimental design and inference skills were maintained and transferred to

tasks as they bear no surface similarity to and make no reference to the

Progress during intervention. Progress in varying the focal variable and

Never varied focal variable

inconsistent control of other variables

Table 2. Comparison of Groups on Far Transfer of Design and Inference Skills

Did not consistently vary focal

variable but inconsistent

Prediction error mean

performance of zero error

With respect to individual patterns of consistency in attributions, the same

Evidence and counterargument. Results for students in the three

A. Air pollution is a more

conditions, X2(2)=26.42, p<.0001 and X2(2)=7.39, p=.025, respectively, but not

In hindsight, we perhaps should not have expected to see significant

Kuhn, D. (1995). Microgenetic study of change: What has it told us?

Table 1. Comparison of Intervention Groups on Maintenance and Near Transfer

Never varied focal variable

inconsistent control of other variables

Table 2. Comparison of Groups on Far Transfer of Design and Inference Skills

Did not consistently vary focal

variable but inconsistent

Table 3 Mean Prediction Error Scores and Modal Frequencies by Group

Prediction error mean

performance of zero error

Table 5 Percentages of students making different response choices regarding

A. Air pollution is a more

Figure 1 InspireData Chart Showing all Cases

Figure 3 InspireData Chart for the Fitness Factor

Appendix: Intervention Protocol

(Pair examines the two records side-by-side, with outcomes shown.)

Phase 2. Multivariable Coordination

You might also like