Middle School Students ’ Approaches to Reasoning about Disconfirming Evidence

This study investigated differences in how middle school children reason about disconfirming evidence. Scientists evaluate hypotheses against evidence, rejecting those that are disconfirmed. Although this instant rationality propels empirical science, it works less for theoretical science, where it is often necessary to delay rationality – to tolerate disconfirming evidence in the short run. We used behavioral measures to identify two groups of middle-school children: strict reasoners who prefer instant rationality and quickly dismiss disconfirmed hypotheses, and permissive reasoners who prefer delayed rationality and retain disconfirmed hypotheses for further evaluation. We measured their scientific reasoning performance as well as their cognitive ability and motivational orientation. What distinguished the groups was not overall differences in these variables, but their predictive relation. For strict reasoners, better scientific reasoning was associated with faster processing, whereas for permissive reasoners, better scientific reasoning was associated with more deliberate thinking – slower processing and broader consideration of both disconfirmed and alternate hypotheses. These findings expand our understanding of “normative” scientific reasoning.


Introduction
This study investigated differences in how middle school children reason about disconfirming evidence.We hypothesized that some children are naturally inclined towards fallibilism (Popper, 1963).These strict reasoners prefer instant rationality, and quickly dismiss hypotheses that have been disconfirmed.We hypothesized that other children are permissive reasoners: They prefer delaying rationality, and are willing to retain hypotheses in the face of disconfirming evidence for further evaluation.
positively associated with intelligence (Miyake & Friedman, 2012;Miyake, Friedman, Emerson, Witzki, Howerter, & Wager, 2000) and that improves over development (Huizinga, Dolan, & van der Molen, 2006).We predict that strict and permissive reasoners will be comparable in these basic and complex cognitive abilities.This prediction is surprising from the fallibilist position that instant rationality is normative, which expects strict reasoners to be "better" than permissive reasoners. 1  With respect to motivation, we focused on Dweck's (2000;Dweck & Leggett, 1988) proposal that people possess implicit theories of intelligence that govern how they respond to failure.Entity theorists believe that intelligence is fixed, and that failures cannot be overcome.By contrast, incremental theorists believe that intelligence is malleable, and that failures serve as feedback in the learning process -and can therefore be overcome.We predict that incremental theorists will be better scientific reasoners than entity theorists, and that this should be true for both strict and permissive reasoners.This follows from our proposal that the groups do not differ on whether they adjust their beliefs based on disconfirming evidence.Rather, they differ on the time course of adjustment -instantly or after a delay.
Although we predict no differences between strict and permissive reasoners in overall cognition, motivation, and scientific reasoning, we do predict that group will modulate the relation between cognitive ability and motivational orientation on one hand and scientific reasoning on the other.Strict reasoners prefer instant rationality, and therefore faster processors will be better scientific reasoners, because it is the fluency with which hypotheses are evaluated against evidence -and potentially disconfirmed -that is paramount.By contrast, permissive reasoners prefer to delay rationality, and therefore we predict that better scientific reasoning will be associated with less impulsive and more deliberate thinking as indexed by slower processing speed and systematic consideration ofand shifting between -alternate hypotheses in the WCST.
We focus on children rather than adults for several reasons.The newly revised national science standards emphasize the need for K-12 students to engage in authentic science experimentation to develop scientific practices such as designing and carrying out valid investigations, analyzing and interpreting data, constructing explanations, and drawing evidence-based conclusions (NRC, 2012).Moreover, there is a correlation between hindsight bias, which we interpret positively as a form of permissive reasoning, and executive function, a basic cognitive ability, in preschool children (Bernstein, Atance, Meltzoff, & Loftus, 2007).We further focused on middle-school students because prior research has documented relationships between the measures relevant to our research questions in this age group.There is a correlation between scientific reasoning and WCST, a measure of complex cognitive ability, in middle-and high-school students (Kwon & Lawson, 2000).In addition, prior research has found a relation between achievement in mathematics, a subject closely related to science, and motivational orientation -possessing an incremental versus entity theory of implicit intelligence -in middle-school students (Blackwell, Trzesniewski, & Dweck, 2007).

Participants
Participants were 109 7 th grade students (age: M = 13.37 years, SD = 0.29; 55 males, 54 females) from a racially and ethnically diverse middle school in a Midwestern suburb.

Design
Each participant completed all measures.Participants were classified as either strict or permissive reasoners based on their performance on two measures of response to disconfirming evidence, as described below.

Measures
We collected measures of response to disconfirming evidence, simple and complex cognitive ability, motivational orientation, and scientific reasoning.All measures were designed or adapted to be administered in a group setting.

Prediction Evaluation Task
We designed two measures of permissive reasoning (i.e., hindsight bias) in scientific reasoning.The first adapted a measure from the social psychology and neuroscience literature (Batson, Thompson, Seuferling, Whitney, & Strongman, 1999;Greene & Paxton, 2009).On each trial, participants first predicted the outcome of a coin flip.The experimenter then flipped the coin and announced the outcome.Finally, participants evaluated their prediction by circling "correct" or "incorrect" on a response sheet.There were 20 trials.Participants could not see the actual outcome of each flip, which was done at the front of the class.Instead, they relied on the experimenter's announcement of the outcome.In fact, the announced outcomes were predetermined so that equal numbers of heads and tails occurred over trials 1-10 and over trials 11-20.Participants who evaluated more than 15 predictions correctly (< 1% chance) were defined as permissive reasoners.

Science Quiz
We designed a second measure of permissive reasoning in scientific reasoning. 2Participants completed a science quiz of 10 multiple-choice questions; see Appendix A for example items.They were told these were college-level questions, and that they were unlikely to know any of the answers ("even your teacher would have difficulty with this quiz"), but that they should try their best.They were also told that the experimenter had made a mistake and had not printed the quiz, but rather the answer key to the quiz.Thus, the "answers" were at the bottom of the quiz, printed upside down and backward.Participants were told not to look at them.These "answers" were actually incorrect, and participants who gave six or more of them (< 1% chance) were defined as permissive reasoners.

Processing Speed
We measured processing speed using the "Coding B" scale of the WISC-IV®, adapted for group administration as specified by Varma, Varma, Van Boekel, and Wang (2016).Participants were given a response sheet.A key at the top mapped the digits 1-9 to arbitrary symbols.Several rows of digits appeared at the bottom; below each digit was an empty box.Participants had two minutes to translate as many digits to symbols as possible, working left-to-right and top-to-bottom.The dependent variable was the number of correctly translated symbols.

Working Memory
We measured WM using the "Forward Digit Span" scale of the WISC-IV®, adapted for group administration as specified by Varma et al. (2016).On each trial, the experimenter read aloud a sequence of digits at a rate of approximately one per second.Participants were then cued to reproduce the sequence from memory on their response sheet.There were two sequences of each of the lengths 2-9.The dependent variable was the total number of correctly recalled digits.A limitation of this measure is that it only taps the storage component of WM, not the processing component (Baddeley & Hitch, 1974;Just & Carpenter, 1992).It was chosen because it proved difficult in pilot testing to implement more conventional measures of WM such as complex span (Daneman & Carpenter, 1980;Turner & Engle, 1989) and backward digit span (from the WISC-IV®) in a group setting without widespread cheating.

Selective Attention
We measured selective attention using a paper-and-pencil version of the flanker task (Eriksen & Eriksen, 1974) adapted for group administration (Varma et al., 2016).Participants first completed a page of 32 neutral stimuli, each consisting of a central arrow pointing to the left or right, flanked on each side by two asterisks (e.g., "****")."L" and "R" appeared below the leftmost and rightmost asterisk, respectively, and participants circled the one indicating the direction of the central arrow.After completing the page, participants consulted a stopwatch projected at the front of the classroom and recorded their completion time.They then completed a page of 32 interference stimuli, where the central arrow was flanked by arrows that either did (e.g., "") or did not (e.g., "") point in the same direction, and again recorded their completion time.The dependent variable was the interference completion time minus the neutral completion time.

Cognitive Flexibility
We measured cognitive flexibility using a version of the WCST shortened and adapted for group administration (Varma et al., 2016).48 stimuli were projected at the front of the classroom.Each showed five cards, one target and four standards labeled A-D; see Appendix A for an example stimulus.Each card varied on four levels of each of three dimensions: color, number, and shape.Participants judged which of the four standard cards the target card "was most similar to" by circling one of A-D on their response sheet.Feedback was then provided by masking all but the correct standard card.The rule silently shifted every eight cards, cycling twice through same color, same number, and same shape.We coded the most common measure of WCST performance, number of perseverative errors, defined as errors immediately following a rule shift resulting from applying the disconfirmed rule.We also coded one novel measure, number of systematic errors, defined as errors immediately following a rule shift resulting from applying a logically possible rule (i.e., not the disconfirmed rule).Whereas perseverative errors are "bad," systematic errors are "good," indicating deliberate search of the hypothesis problem space (Klahr & Dunbar, 1988).

Implicit Theory of Intelligence
We measured implicit theory of intelligence by adapting the Dweck (2000) measure.We modified the three items to be specific to science and simplified the wording in consultation with the classroom teacher; see Appendix A. Higher scores indicate an incremental theory and lower scores an entity theory.

Self-Efficacy
To ensure that it is the implicit theory of intelligence that is important and not motivation more generally, we also measured self-efficacy.We adapted Dweck's (2000) measure of self-confidence in intelligence, modifying the items to be specific to science; see Appendix A. Higher scores indicate higher self-efficacy.

Scientific Reasoning
Existing measures of scientific reasoning in middle school children are based on Piagetian concepts (e.g., Lawson, 1978).We utilized a new measure of scientific reasoning (Varma et al., 2013) derived from an extensive review of the current psychological and educational literatures on scientific reasoning, and designed to be consistent with the National Science Standards (NRC, 2012).It spans five facets of scientific reasoning: 1) Hypothesis generation: the ability to observe a situation or event, recognize the difference between existing understanding and what more needs to be learned, and clearly articulate a question that can direct an empirical investigation.
2) Hypothesis testing: the ability to design valid tests of a hypothesis that correctly identify and manipulate all relevant variables in order to produce empirical evidence that will allow one to answer questions.
3) Reasoning from evidence: the ability to interpret the results of an investigation and to draw justified inferences and/or conclusions based upon the data.
4) Providing explanations: the ability to coordinate theory and evidence to draw inferences about causal or statistical relations.
5) Coordinating theory and evidence: the ability to evaluate a theory in light of experimental outcomes, reconcile new evidence with prior beliefs, and (if required) revise one's theory and generate new predictions.
The measure consists of ten multiple-choice and explanation items adapted from research on scientific reasoning (i.e., Tschirgi, 1980) and released standardized tests (i.e., TIMSS, 1995;2003); see Appendix B for example items.
Higher scores indicate better scientific reasoning.

Procedure
The scientific reasoning measure was administered by the classroom teacher during one class period; participants required approximately 20 minutes to complete it.All other measures were administered by experimenters during another class period, and required approximately 45 minutes to complete.Measures were administered in a fixed sequence designed (based on pilot testing) to maximize the interest of the participants.For example, they were more engaged by the "race against the clock" nature of the processing speed measure and less engaged by the self-efficacy measure.A drawback of the design decision to maximize engagement is that it may have introduced order effects.

Results
Because all measures were group administered, it was possible for disinterested participants to disengage.We therefore trimmed participants who made more than five errors on the digit-symbol substitution measure, more than 50% errors on the flanker, and fewer than five errors on the WCST. 3 The trimmed sample consisted of 94 participants (age: M = 13.36 years, SD = 0.28; 49 females, 45 males).The 29 participants who demonstrated hindsight bias on the prediction evaluation task or science quiz formed the permissive reasoning group.The remaining 65 participants formed the strict reasoning group. 4

Overall Differences Between Strict and Permissive Reasoners
We first evaluated whether the strict and permissive reasoning groups differed on cognitive ability, motivational orientation, or scientific reasoning, comparing them on each measure using independent t-tests (Table 1).The groups did not differ on any of the measures (p > 0.30 for all).This is inconsistent with fallibilism, which equates strict reasoning with normative reasoning, and therefore expects strict reasoners to have better scientific reasoning and cognitive ability scores than permissive reasoners.However, it is consistent with the proposed distinction between strict and permissive reasoning.In particular, we identified these groups using measures derived from the hindsight bias literature, and prior research has found no correlation between hindsight bias and intelligence as measured by Raven's Progressive Matrices (Pohl & Eisenhauer, 1995).a Two participants failed to report their age.Therefore, the degrees of freedom on this test was 90.

Relational Differences Between Strict and Permissive Reasoners
We next evaluated the prediction that strict and permissive reasoners differ in the relation between cognitive ability and motivational orientation on one hand and scientific reasoning on the other.For each group, we regressed scientific reasoning on the cognitive and motivational variables using a hierarchical approach to first control for (1) age, and then to evaluate the additional predictive power provided by (2) basic cognitive ability, (3) complex cognitive ability, and (4) motivational orientation.The results are shown in Table 2.For strict reasoners, the basic cognitive variables and motivational orientation variables each explained significant additional variance, and the final, full model accounted for 37.5% of the variance in scientific reasoning.For permissive reasoners, age and the motivational orientation variables each explained significant additional variance, and the final, full model accounted for 58.0% of the variance.To put these fits in context, only one previous educational psychology study has used cognitive measures to predict scientific reasoning ability in middle (and high) school students.Kwon and Lawson (2000) found that age, WCST, Tower of London, Group Embedded Figures, and "mental capacity" accounted for 56.1% of the variance on a measure of logico-scientific reasoning (Lawson, 1978).
Table 2. Hierarchical multiple regressions predicting scientific reasoning for the strict and permissive groups.
Step The final, full regression models for each group are shown in Table 3.For strict reasoners, better scientific reasoning was associated with faster processing speed.This is consistent with the theoretical proposal that strict reasoners prefer instant rationality.For this group, faster processing enables more efficient evaluation of hypotheses against evidence and dismissal of those that are disconfirmed.Better scientific reasoning was also associated with having a more incremental theory of intelligence, a finding we return to in the Discussion.For permissive reasoners, the results were more complex -and more interesting.Better scientific reasoning was associated with slower processing speed, higher WM capacity, and more systematic errors on the WCST.These findings are consistent with the theoretical proposal that permissive reasoners prefer delaying rationality.They are willing to retain hypotheses in the face of disconfirming evidence, at least in the short run.Higher WM capacity and more systematic search of the hypothesis problem space enable a more thorough evaluation of the disconfirmed hypothesis and consideration of alternative hypotheses when confronted with disconfirming evidence.For this group, faster processing might be a detriment, leading to more impulsive decision-making.
Better scientific reasoning was also associated with more perseverative errors.Although this finding was not predicted a priori, it can be interpreted as a consequence of delaying rationality.Retaining a disconfirmed hypothesis is the wrong choice when the disconfirming evidence is in fact veridical, as is the case for perseverative errors in the WCST.However, it is the correct choice when the disconfirming evidence represents a Type I error, as is sometimes the case in scientific reasoning (Simmons, Nelson, & Simonsohn, 2011).Finally, better scientific reasoning was associated with having a more incremental theory of intelligence, a finding we return to in the Discussion.

Discussion
In this study we proposed the existence of two groups, strict reasoners who prefer instant rationality and permissive reasoners who prefer delayed rationality.We identified the permissive reasoners behaviorally by their willingness to engage in hindsight bias on a prediction evaluation task or science quiz, which we reconceptualized as a willingness to retain or revise hypotheses given disconfirming evidence.Fallibilism expects permissive reasoners to be worse scientific reasoners than strict reasoners.In fact, Popper (1963) famously dismissed Freudianism and Marxism based on the permissive reasoning of their advocates -their willingness to retain these theories in the face of disconfirming evidence.Fallibilism also expects permissive reasoners to be of lower cognitive ability than strict reasoners.These expectations were not supported: the two groups did not differ on any of the scientific reasoning, cognitive ability, and motivational orientation measures.
By contrast, the theoretical proposal that strict reasoners prefer instant rationality and permissive reasoners prefer delayed rationality was supported by the finding that group membership modulated the relation between cognitive and motivational factors on one hand and scientific reasoning on the other.That strict reasoners prefer instant rationality was supported by the finding that better scientific reasoning was associated with faster processing speed.For this group, faster processing enables more fluent evaluation of hypotheses against evidence, freeing up limited cognitive resources for higher-order scientific inferencing.
That permissive reasoners prefer delayed rationality was supported by the finding that better scientific reasoning was associated with slower processing, greater WM capacity, and making more systematic errors on the WCST.We interpret this pattern as evidence of a speed-accuracy trade-off.More deliberate permissive reasoners cautiously retain and carefully revise hypotheses in the face of disconfirming evidence, whereas their more impulsive counterparts breeze by it without fully considering its implications, resulting in worse scientific reasoning.This is consistent with the recent demonstration that slower data presentation rates enable people to defer hypothesis evaluation (Lange, Thomas, Buttaccio, Illingworth, & Davelaar, 2013).
That better scientific reasoning was associated with making more perseverative errors on the WCST was surprising given that such errors are associated with lesions to prefrontal cortex (Milner, 1963), with lower executive function and psychometric intelligence (Miyake et al., 2000), and with worse logico-scientific reasoning in middle and high school children (Kwon & Lawson, 2000).We offered a post hoc interpretation of this finding above, which we articulate further here.Scientific reasoning is probabilistic.Disconfirming evidence usually indicates that a hypothesis is incorrect -but it can also represent a Type I error.In the latter case, retaining a disconfirmed hypothesis in the short-term (i.e., perseverating) is adaptive, a form of replication indicative of deliberate thinking (Simmons et al., 2011).By contrast, the WCST is deterministic.Perseveration is never adaptive; a logical error is, and will continue to be, a logical error.Thus, making perseverative errors can be associated with better performance on more probabilistic scientific reasoning tasks, even if it is associated with worse performance on more deterministic logical tasks.
For both strict and permissive reasoners, better scientific reasoning was associated with having a more incremental implicit theory of intelligence.This finding does not follow specifically from our definitions of strict and permissive reasoners.Rather, we interpret it as an example of the more general finding in the motivation literature that people with incremental theories respond more adaptively to failure than people with fixed theories, and that is associated with better performance in domains such as mathematics (Blackwell et al., 2007).The current study extends this finding to the domain of science.

Limitations and Future Directions
The current study suffers from a number of limitations that should be addressed in future research.The size of the permissive reasoning group was (29 participants) was rather small given the number of independent variables in the regression analyses, and it is therefore important to replicate this study with a larger sample.
One question for future research is the role of inhibition in scientific reasoning.Prior research has shown that when correct scientific theories are learned, they do not supplant incorrect naïve beliefs, but rather suppress them during scientific reasoning, especially under speeded conditions (Goldberg & Thompson-Schill, 2009;Kelemen & Rosset, 2009;Knobe & Samuels, 2013;Shtulman & Valcarcel, 2012).This suggests a central role for inhibition when evaluating theoretical assumptions against disconfirming evidence.Given that we identified strict versus permissive reasoners using measures derived from the hindsight bias literature, and given the association between hindsight bias and inhibition (Bernstein et al., 2007), it is possible that strict reasoners and permissive reasoners use inhibition differently.Strict reasoners might use inhibition to suppress a disconfirmed hypothesis in the presence of disconfirming evidence, to be able to consider alternate hypotheses.By contrast, permissive reasoners might use inhibition to suppress disconfirming evidence so that a disconfirmed hypothesis can be maintained in the presence of alternate hypotheses.Future research should evaluate these predictions, and the differential roles of inhibition for strict versus permissive reasoners more generally.
This study focused on individual differences in the processes involved in scientific reasoning (e.g., Klahr & Dunbar, 1988).Accordingly, our measures were knowledge-lean.Further research is needed to extend our findings to knowledge-rich contexts, where participants engage in scientific reasoning around scientific concepts.Relevant here are cognitive development studies of the relationship between conceptual knowledge and patterns of reasoning.For example, Howe and colleagues have investigated how elementary and middle school age children reason about physics concepts (e.g., Howe, Taylor Tavares, & Devine, 2014).Their work distinguishes between deliberate engagement with explicit concepts versus less reflective engagement utilizing tacit knowledge.
Another question for future research is whether strict versus permissive reasoners differ on non-cognitive dimensions.We identified these two groups using measures derived from the hindsight bias literature.Some theories of hindsight bias stress the role of non-cognitive factors such as motivated sense-making (Pezzo & Pezzo, 2007) and metacognitive surprise (Müller & Stahlberg, 2007).Future research should investigate the relative contributions of cognitive and non-cognitive factors to strict versus permissive reasoning.
The current work has implications for the development of scientific reasoning.Better scientific reasoning was associated with cognitive abilities that improve over development: processing speed for strict reasoners, and WM and WCST performance for permissive reasoners.The question of whether these associations change over development could be addressed in a cross-sectional or longitudinal study.The question of whether a child's membership in either the strict or permissive group is stable over development or whether it fluctuates could be addressed in a longitudinal study.
The current work also has implications for the developmental neuroscience of scientific reasoning and cognitive abilities more generally (Kwon & Lawson, 2000).Neuroimaging studies of children and adults have established prefrontal cortex as a neural correlate of WM (Darki & Klingberg, 2015), inhibition (Rueda et al., 2004), and WCST performance (Konishi et al., 2008).Prefrontal cortex is also a neural correlate of reasoning about disconfirming evidence in logical (Goel & Dolan, 2003) and scientific (Fugelsang & Dunbar, 2005) contexts.Finally, prefrontal cortex is recruited as people deepen their understanding of physical systems (Mason & Just, 2015).One question, then, is whether strict vs. permissive reasoners rely on different brain areas when reasoning about disconfirming evidence, reflecting the employment of different cognitive processes as documented in the current study.For example, WM capacity was positively associated with scientific reasoning for permissive reasoners (but not strict reasoners).The prediction, then, is that permissive reasoners (but not strict reasoners) might show greater recruitment of lateral PFC areas associated with WM when reasoning about disconfirming evidence.
Addressing these questions will set the stage for future research on science education.One goal would be to evaluate whether group differences in strict versus permissive reasoning (and individual differences in cognitive ability and motivational orientation) predict science achievement as measured by standardized tests.Another goal is to design instruction that encourages children to reason more permissively about disconfirming evidence than is emphasized by the current science standards, where fallibilism is the norm (NRC, 2012).The result would be a shift from an emphasis on hypothesis evaluation to one on theory development (Chinn & Brewer, 1993).

Conclusion
The most important contribution of the current research is the proposal that people who reason permissively about disconfirming evidence are different from -not worse than -people who reason strictly.We identified permissive reasoners in a novel way, by recasting hindsight bias as a behavioral indicator of comfort with contradiction rather than a deviation from rationality and normativity, as it is viewed from the fallibilist perspective on science (Popper, 1963) and in the reasoning and decision-making literatures (Fischhoff, 1975;Nickerson, 1998;Slovic & Fischhoff, 1977).Permissive reasoners sometimes use disconfirming evidence to revise existing theoretical assumptions to better approximate reality, and sometimes set aside such evidence pending future replication.Their delayed rationality, necessary for theory development, balances the instant rationality prized by experimentalists.

Table 1 .
Independent t-tests comparing the strict and permissive groups on all variables

Table 3 .
Final (step 4) multiple regression models predicting scientific reasoning for the strict and permissive groups.