Kirk Allen Defense

The Statistics Concept Inventory: The Development and Analysis of a Cognitive Assessment Instrument in Statistics
Dissertation Defense
Kirk Allen
May 2, 2006
Organization
Book One
Creation of the SCI
Book Two
Expanding, Doing more with the data
Book Three
Re-validating
Book Four
Summarize, Speculation on the future
My personal timeline
Fall 2002 started Grad school Summer/Fall 2004 decided to go straight for Ph.D. Fall 2005 General exams Spring 2006 Taking my final class Spring 2006 Graduating!
Background
Statistics Concept Inventory (SCI) project began in Fall 2002 Based on the format of the Force Concept Inventory (FCI)
Shifts focus away from problem solving, which is the typical classroom format Focus on conceptual understanding Multiple choice, around 30 items
Force Concept Inventory

Focuses on Newtons three laws and related concepts Scores and gains on initial testing much lower than expected Led to evaluating teaching styles Interactive engagement found to be most effective at increasing student understanding
Other Concept Inventories

Many engineering disciplines are developing concept inventories
e.g., thermodynamics, circuits, materials, dynamics, statics, systems & signals Foundation Coalition
http://www.foundationcoalition.org/home/keycomponents/concept/index.html
Book One
The process of creating the SCI
I would have defended this as my Masters thesis
Traditional (approximately), five-chapter format

1. Introduction (short) 2. Test Theory
The methods that were used in creating the SCI
3. Concept inventories
Descriptions of other work along similar lines
4. Methods and Results

Combined because Methods is short
5. Preliminary conclusions (short)
Results Spring 2005

Course Quality Engr Math #1 Math #2 External Psych Level Junior IE Intro, calc Intro, calc Intro, calc Intro, calc Intro, algebra Mean, pre 44.9% 40.7% 46.8% 48.5% -38.6% Mean, post SD, post -44.9% 44.0% 45.6% 49.8% 43.9% 13.4% (pre) 14.6% 14.4% 14.4% 13.8% 10.9%
Reliability (Spring 2005)

Course Quality Pre-Test Alpha 0.7084 Post-Test Alpha --
Engr
Math #1
0.6619
0.6071
0.7744
0.7676
Math #2
Psych
0.7640
0.4284
0.7079
0.5918
Content Validity
Content validity refers to the extent to which items are (1) representative of the knowledge base being tested and (2) constructed in a sensible manner (Nunnally) Focus groups ensure that the question is being properly interpreted and help develop useful distracters
Content Validity
Faculty survey statistics topics were rated for their importance to the faculty helps provide a list of which topics to include on the SCI
AP Statistics course outline also consulted for topic coverage
Gibbs criteria identify poorly written questions
Concurrent Validity
For Spring 2004 Three courses: 1 Engr, 2 Math
Course
SCI Pre
SCI Post
SCI Gain
SCI Norm.Gain
Engr (n=29)
Math #1 (n=30) Math #2 (n=26)
r = 0.060 (p = 0.758)
r = 0.323 (p = 0.081) r = 0.219 (p = 0.282)
r = 0.133 (p = 0.493)
r = 0.502** (p = 0.005) r = 0.384 (p = 0.053)
r = 0.080 (p = 0.679)
r = 0.316 (p = 0.089) r = 0.303 (p = 0.133)
r = 0.108 (p = 0.578)
r = 0.353 (p = 0.056) r = 0.336 (p = 0.094)
Construct Validity
Three-factor and four-factor FIML with general factor
Descriptive, inferential, probability, and graphical sub-tests Graphical a priori grouped with Descriptive in 3-factor Confirmatory Model Overall results: Item Uniqueness 70.1% and 70.4% Preference is for four-factor model because graphical items are a separate sub-test
More on this later!
Item Discrimination Index

Compares top quartile to bottom quartile on each item Generally around 1/3 of the items fall into each of the ranges poor (< 0.20), moderate (0.20 to 0.40) and high (> 0.40)
Item Analysis
Discrimination index Alpha-if-deleted
Reported by SPSS or SAS Shows how overall alpha would change if that one item were deleted
Answer distribution
Try to eliminate or improve choices which are consistently not chosen
Focus group comments
Understanding p-values
A researcher performs a t-test to test the following hypotheses: H :
0 0
H1 : 0
He rejects the null hypothesis and reports a p-value of 0.10. Which of the following must be correct?
a) The test statistic fell within the rejection region at the significance level b) The power of the test statistic used was 90% c) Assuming Ho: is true, there is a 10% possibility that the observed value is due to chance ** d) The probability that the null hypothesis is not true is 0.10 e) The probability that the null hypothesis is actually true is 0.9
Results for 4 classes

Pre #1 Post #1 Pre #2 Post #2 Pre #3 Post #3 Pre #4 Post #4 Choice a% Choice b% 15% 16% 41% 18% 32% 14% 52% 20% 5% 14% 67% 0% 17% 6% 18% 9%
Choice c%
41%
35% (-6%)
6%
41%
15% (-24%)
12%
62%
27% (-35%)
7%
47%
42% (-5%)
24%
Choice d%
18%
14%
19%
19%
Choice e%
2%
0%
0%
0%
0%
0%
11%
6%
Analysis
Discrimination
Pre: Post: 0.25, -0.17, 0.52, 0.15 0.00, -0.14, 0.25, 0.33
P-value question
Problems?
too definitional p-value taught from an interpretive standpoint
when to reject or not reject the null hypothesis
Therefore
New question
(not a replacement)
An engineer performs a hypothesis test and reports a p-value of 0.03. Based on a significance level of 0.05, what is the correct conclusion?
a)The null hypothesis is true. b)The alternate hypothesis is true. c) Do not reject the null hypothesis. d)Reject the null hypothesis **
Results of New Question

Discrimination better (post-test)
0.20, 0.29, 0.75, 0.12 still not great overall (0.19)
Percent correct and gains low

Post-test % correct (gain +/-%)
6% (-17%) 20% (-3%) 33% (+4%) 19% (+10%)
Moving on
Similar analyses were conducted for all items, a sort of bottom-up approach to developing the test For right or wrong, the test has changed very little since Spring 2004 No need to continually repeat the item analysis tables with such fine detail Lets see what else we can do with the SCI!
Exploring Reliability
Results (older, also presented during Proposal)
Demonstrated strong relationships between:
Alpha-if-deleted (a measure of item reliability) Discrimination index Gap the average total score of students who answered an item correctly <minus> average total score of students who answered an item incorrectly
Mean(correct) <minus> Mean(incorrect)
Focus groups tell us that guessing or using test-taking tricks are valid causal agents for poor item statistics
This meshes with theory because you would expect these items to have a Gap of zero, which lowers total-score variance
Online Test
From Proposal Defense Comparing online vs. paper Differences found
9 items Sub-test (except Probability) and overall scores Reliability: Probability and Inferential Problem with study: nearly all paper students at one university, which had very good overall results Differences are still not large
Other findings
Order effects
No systematic bias in question order No correlation between percent correct and order position Small (but significant) downward trend in answer confidence
Only about 5% of the total rating scale from beginning of test to the end
Round 2
Spring 2006, pre-test Two sections of same course, taught by same professor, took SCI on the same day
One paper (n=14), one online (n=16) Very similar demographics
Comparative results (next slide) Interesting finding

Online: time and number correct inversely related Paper: opposite Not rigorously assessed
Measure Reliability Probability Inferential Total Descriptive Inferential Graphical Probability #1 (Probability) #2 (Inferential) #3 (Descriptive) #7 (Graphical) #9 (Descriptive) #15 (Descriptive) #18 (Inferential) #19 (Inferential) #21 (Probability) #22 (Inferential) #28 (Graphical) #35 (Inferential) #36 (Inferential)
Mean Variance
Fall 2005
Spring 2006
Items
Problems
Fall 2005
Confounding with university
Spring 2006
Pre-test Small sample size
The Problem with Educational Research
Statistical Power
Rigour / Control
Chapter 8
Part A: Lit review
Background on difficulties Attitudes Reasoning skills: Probability
Kahneman & Tversky
Reasoning skills: Statistics Some teaching strategies
Chapter 8
Part B: Confidence on the SCI Original?
The reviewed studies are generally very specific and in-depth on a certain topic Or, they are very general as to why students have difficulties (e.g., attitudes) Nothing which provides a broad comparison identifying conceptual difficulties across statistics
So use the SCI to do this.
Chapter 8
Method
After students answer each question for the online SCI, the following is presented to them.
Results big picture
Results sample item

Which would be more likely to have 70% boys born on a given day: A small rural hospital or a large urban hospital? a) Rural b) Urban c) Equally likely d) Both are extremely unlikely
Rank 10th in correct (low) Rank 25th in confidence (high) Students are over-confident
Results the graphs
Results comparison
Kahneman & Tversky studied a very similar problem as part of the representativeness misconception of probability
Subjects do not appreciate that large samples are more likely to be representative of the population 20% correct, 56% equally likely
Smaller N of subjects, also inexperienced
SCI online: 37% correct, 45% equally likely
Chapter Nine
Questions
Is the SCI a multi-dimensional test? How does this compare to such a broad field as statistics? What implications can be drawn from the answers to the above?
Reliability
Common measure Cronbachs alpha is an under-estimate of reliability for a multidimensional test Other measures account for this
Theta based on largest eigenvalue from a principal component analysis Omega based communalities from a factor analysis, thus depends on the number of factors
1.00
0.95 omega 0.90

0.8907
0.85
0.8446
theta = 0.8123 0.80

0.8081 0.798
alpha = 0.7650 0.75 1 38
Results indicate multi-dimensionality But is it meaningful?
Exploratory Factor Analysis (EFA)

Many decisions to be made
Extraction method Number of factors Factor loadings Rotation method
Simple structure
Each variable ideally loads along exactly one factor Minimize number of variables per factor
Not the best paradigm

Other concept inventories have done it Curiosity (I wonder what all those options in SPSS mean.)
Decisions
Extraction method
Principal components chosen
Not ideal maximizes extracted variance Does not optimize the prediction of the overall correlation structure
Quick comparison of PC vs. ML

First-factor loadings from a four-factor solution had a mean absolute-difference of 0.030 (small)
Decisions
Number of factors
Eigenvalues > 1
Fifteen factors
Scree plot (next slide)

One Four Nine ?
Parallel analysis
Compare eigenvalues to random data One or four
0 1 38
Decisions
Assigning items to factors
How large does the factor loading need to be for the variable to be assigned to a given factor? Investigated 0.1 to 0.5
35 30 25 20 15 10 5 0 0.1 0.2 0.3 0.4 0.5
Decisions
Rotation
Orthogonal
Next slide: five-factor solution
Oblique
Involves extra parameters depending on method Following slide: promax rotation, with parameter Kappa
35 30 25 20 15 10 5 0 Unrotated Equamax Quartimax Varimax
35 30 25 20 15 10 5 0 2 3 4 5 6 8
Decisions
Number of factors
Four (unrotated) and five (rotated) best approximate simple structure One-dimensional structure is mostly likely, based on scree plot and parallel analysis
Factor loadings
Values around 0.32 best; use 0.30 for simplicity
Rotation
Unrotated: Varimax Rotated: Promax with Kappa =3
Conclusions
Items generally do not group in a meaningful way But, some pairs of highly similar items grouped along the same factors What now?
Confirmatory Factor Analysis (CFA)

Presumes the analyst has a pre-conceived notion of the underlying structure
Its probably not wise to write a test to assess a domain that you dont have a map of.
Decisions are made a priori, with model comparisons more formal
Models
1. Uni-dimensional 2. Capture finer clustering of similar items 3. Prior work concluded general factor plus four sub-topics
Statistics (G)
w1 w2 w3 w38
Q1
e1
Q2
e2
Q3
e3
Q38
e38
Statistics (G)
f1
Test statistics
w1 s2 s36
w38
Q1
e1
Q2
e2
Q36
e36
Q38
e38
Model Fit
Overall fit (chi-square)
Ho: model fits Function of sample size
Nearly always reject the null in practice
Fit indices
Alternate way to assess fit Too many to name
Results
Chi-sq d.f. (p) GFI PGFI
(1)
(2) (3)
785
771 682
665
659 617
0.0009 0.8805 0.8329

0.0017 0.8828 0.8275 0.0355 0.8952 0.7857
Conclusions
One-factor (1) is most parsimonious (2) not appreciably worse
Model is too sparse based on current SCI
(3) provides best overall fit, but with noticeable loss of parsimony
Different data Different methods Im not throwing it out still there for those so inclined
Problems
Used regular correlation instead of tetrachoric Normality violated Sample size too small Literature indicates these problems will affect the absolute magnitude of estimates but not the relative magnitudes
So its ok for what I did.
Proposal
Test Statistics Standard Deviation
Statistics
Correlation
Confidence Intervals
p value
Proposal
Could resemble original four proposed topic areas Youre gonna need a bigger boat.
Reliability Revisited
Based on the preferred uni-dimensional model, shortening the current SCI seems reasonable Use objective criteria
Discriminatory index Alpha-if-deleted Communalities Strong correspondence between metrics
Reliability Revisited
By alpha-if-deleted, 23 items is optimal length Selected 25 as my preferred length due to correspondence between metrics and because its a nice round number Cross-validation indicates a shorter SCI maintains the overall reliability
Full: 0.7650 Cut: 0.7655 (simulated based on 23-item SCI)
Chapter 10
Re-assess Content Validity Interviews Faculty survey
Interviews
Prefer interviews over focus groups because SCI does not involve group decision-making Informal approach IE grad students
Experienced statistics students Not hand-cuffed by pre/post timetable
Interviews
Sample item (retained) Text is ok, as opposed to symbols Un-anticipated approach
B is the more conservative test
Incorrect reasoning
D is what is good for the company
A bottling company believes a machine is under-filling 20-ounce bottles. What will be the alternate hypothesis to test this belief? a) On average, the bottles are being filled to 20 ounces. b) On average, the bottles are not being filled to 20 ounces. c) On average, the bottles are being filled with more than 20 ounces. d) On average, the bottles are being filled with less than 20 ounces.
A coin of unknown origin is flipped twelve times in a row, each time landing with heads up. What is the most likely outcome if the coin is flipped a thirteenth time? a) Tails, because even though for each flip heads and tails are equally likely, since there have been twelve heads, tails is slightly more likely b) Heads, because this coin has a pattern of landing heads up c) Tails, because in any sequence of tosses, there should be about the same number of heads and tails d) Heads and tails are equally likely
Interviews
Deleted item Context of coins seems ingrained 50/50
always; fixated But probability has to be different
Still answered D though!
Consideration of control!
But erred for gamblers fallacy
Recommend new context for this item
Faculty Survey
Rated the importance of 87 statistics topics on 1 to 4 scale 24 participants
IE faculty listserv and emailed SCI contacts Not at OU
Compared with previous survey conducted at OU
Results
Generally strong correspondence between old and new surveys
Correlation 0.69 ranks, 0.67 numbers Scales differ
New median 2.95, old median 2.61
Consider two surveys in tandem, using ranks
Based on 25 retained items
Results
16 topics ranked in Top 25 on both surveys
14 of these are covered Very good!
9 topics in Top 25 of new but not old

Only 2 topics covered Not so good
Exactly the same for old (2 of 9)
Conclusions
Pretty good coverage
Basing results on full 38 items is even better
Could help to survey non-engineers to allow comparisons

IE is the most statistically-inclined engineer, so thats the best audience if you are limited
Concept Inventories
Remember where we came from!
Whats the reference point?
How does the SCI compare to other concept inventories?

Especially others in engineering
Process
Process
From the author of a physics test (not FCI) Generally its pretty good but obviously a simplification
Many activities occur simultaneously Also I think you need to acknowledge that you enhance your validity, reliability, etc as you feedback
Sample Size
Compare SCI to other engineering concept inventories
Uncertainty: unpublished results
Statics is way ahead Speaks of generalizability of results We are in good shape
3000
Statics
2500
2000
Statistics 1500
1000
500
0 1 2 3 4
Scores and Reliability

Scores are low, but this is common in early-phase inventories
Higher scores typically found when teaching methods are assessed
Reliability in a similar range to other inventories

Between 0.70 and 0.80 Statistics seems more difficult to assess in one test (cf. factor analysis)
100 0.75 0.77 0.72 0.67 0.67 0.69 0.70 alpha 60 post-test 49.2 45.5 49.7 49.6 50.5 52.3 46.3 45.7
1.00
80
0.74
0.80
0.60
40
0.40
20
0.20
0
Su 2003 Fa 2003 Sp 2004 Su 2004 Fa 2004 Sp 2005 Su 2005 Fa 2005 n = 103 n = 280 n = 94 n = 16 n = 163 n = 260 n = 60 n = 429
0.00
CI Suggestions
Develop a sequence for related inventories
FCI / MBT Statics Dynamics / Strength of Materials Could Statistics fit with others?? Not currently.
Discuss who uses concept inventories

Colleagues? Friends? Outsiders? Speaks of instructor and thus student motivation
Analysis Techniques
Simple: discriminatory index, percent correct, correlations
Got it!
Advanced: factor analysis, SEM, IRT

Got it! Doesnt appear that anyone else has everything, although others have parts
Other Results
Andreas IRT (dissertation in Mathematics)
Analyze response probability by ability level, for each response Could this be integrated with confidence?? Pedagogical implications
Contributions
The SCI is an original creation
Part of the larger concept inventory scheme Draws on and allows comparisons to literature on statistics and probability reasoning
Analysis and synthesis of the creation process itself Insights into test reliability and validity
Publications
General development (Book One)
FIE 2003, ASEE 2004 conferences
Reliability (Chapter 6)
Under revision for JEE
Online test (Chapter 7)

Will be acknowledged as a data source in all future publications
Confidence (Chapter 8)
FIE 2006 (draft paper accepted pending revision) Theres much much more to pull from here, possibly incorporating interviews
Factor analysis and interviews / survey

Certainly offer proposals for future research. Not sure if publishable at present.
Concept Inventories (Chapter 11)

JEE?
Summary paper (Chapter 12)

JEE?
Process
The structure of the dissertation reflects the prevailing methods and conclusions used in constructing, analyzing, and adapting the SCI. There is meaning in this structure.
We couldnt have created the SCI without some background in test theory, cognitive research, etc. But very important the SCI also allowed us an avenue for further exploration. Chicken? or Egg? The methods evolved along with the instrument.
Criticisms
Lacks focus
I wanted to do EVERYTHING!!
Final chapters are open-ended, more like proposals than finished products
Phase II NSF grant ? Plus thats life.
No formal hypothesis
Question: Can you design a test to assess statistics concepts? Hypothesis: Yes, I can! Conclusion: Heres how I did it.
From the beginning.

Increase input and participation across departments and universities
Improving!
More lit review (Kahneman & Tversky, Pollatsek, Piaget, etc.)

Got it now!
Participation hindered by not teaching Intro Stats

Ditto! But: Does this introduce bias?
The Future
What is being taught? And how?
Instructor surveys (easy) Classroom observation (difficult) Integrate confidence ratings with IRT
Interviews / focus groups more often New items How long has it been??

Kirk Allen Defense

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kirk Allen Defense

Uploaded by

Copyright:

Available Formats

The Statistics Concept Inventory: The Development and Analysis of a Cognitive Assessment Instrument in Statistics

Force Concept Inventory

Other Concept Inventories

Traditional (approximately), five-chapter format

4. Methods and Results

5. Preliminary conclusions (short)

Results Spring 2005

Reliability (Spring 2005)

AP Statistics course outline also consulted for topic coverage

Gibbs criteria identify poorly written questions

Item Discrimination Index

Focus group comments

Results for 4 classes

Results of New Question

Percent correct and gains low

Comparative results (next slide) Interesting finding

The Problem with Educational Research

Reasoning skills: Statistics Some teaching strategies

Results big picture

Results sample item

Results the graphs

SCI online: 37% correct, 45% equally likely

0.95 omega 0.90

theta = 0.8123 0.80

alpha = 0.7650 0.75 1 38

Results indicate multi-dimensionality But is it meaningful?

Exploratory Factor Analysis (EFA)

Not the best paradigm

Quick comparison of PC vs. ML

Scree plot (next slide)

35 30 25 20 15 10 5 0 0.1 0.2 0.3 0.4 0.5

35 30 25 20 15 10 5 0 Unrotated Equamax Quartimax Varimax

Confirmatory Factor Analysis (CFA)

Decisions are made a priori, with model comparisons more formal

0.0009 0.8805 0.8329

Recommend new context for this item

Compared with previous survey conducted at OU

Consider two surveys in tandem, using ranks

Based on 25 retained items

9 topics in Top 25 of new but not old

Exactly the same for old (2 of 9)

Could help to survey non-engineers to allow comparisons

How does the SCI compare to other concept inventories?

Statics is way ahead Speaks of generalizability of results We are in good shape

Scores and Reliability

Reliability in a similar range to other inventories

Discuss who uses concept inventories

Advanced: factor analysis, SEM, IRT

Online test (Chapter 7)

Factor analysis and interviews / survey

Concept Inventories (Chapter 11)

Summary paper (Chapter 12)

From the beginning.

More lit review (Kahneman & Tversky, Pollatsek, Piaget, etc.)

Participation hindered by not teaching Intro Stats

You might also like