You are on page 1of 89

The Statistics Concept Inventory: The Development and Analysis of a Cognitive Assessment Instrument in Statistics

Dissertation Defense

Kirk Allen

May 2, 2006

Organization
Book One
Creation of the SCI

Book Two
Expanding, Doing more with the data

Book Three
Re-validating

Book Four
Summarize, Speculation on the future

My personal timeline
Fall 2002 started Grad school Summer/Fall 2004 decided to go straight for Ph.D. Fall 2005 General exams Spring 2006 Taking my final class Spring 2006 Graduating!

Background
Statistics Concept Inventory (SCI) project began in Fall 2002 Based on the format of the Force Concept Inventory (FCI)
Shifts focus away from problem solving, which is the typical classroom format Focus on conceptual understanding Multiple choice, around 30 items

Force Concept Inventory


Focuses on Newtons three laws and related concepts Scores and gains on initial testing much lower than expected Led to evaluating teaching styles Interactive engagement found to be most effective at increasing student understanding

Other Concept Inventories


Many engineering disciplines are developing concept inventories
e.g., thermodynamics, circuits, materials, dynamics, statics, systems & signals Foundation Coalition
http://www.foundationcoalition.org/home/keycomponents/concept/index.html

Book One
The process of creating the SCI
I would have defended this as my Masters thesis

Traditional (approximately), five-chapter format


1. Introduction (short) 2. Test Theory
The methods that were used in creating the SCI

3. Concept inventories
Descriptions of other work along similar lines

4. Methods and Results


Combined because Methods is short

5. Preliminary conclusions (short)

Results Spring 2005


Course Quality Engr Math #1 Math #2 External Psych Level Junior IE Intro, calc Intro, calc Intro, calc Intro, calc Intro, algebra Mean, pre 44.9% 40.7% 46.8% 48.5% -38.6% Mean, post SD, post -44.9% 44.0% 45.6% 49.8% 43.9% 13.4% (pre) 14.6% 14.4% 14.4% 13.8% 10.9%

Reliability (Spring 2005)


Course Quality Pre-Test Alpha 0.7084 Post-Test Alpha --

Engr
Math #1

0.6619
0.6071

0.7744
0.7676

Math #2
Psych

0.7640
0.4284

0.7079
0.5918

Content Validity
Content validity refers to the extent to which items are (1) representative of the knowledge base being tested and (2) constructed in a sensible manner (Nunnally) Focus groups ensure that the question is being properly interpreted and help develop useful distracters

Content Validity
Faculty survey statistics topics were rated for their importance to the faculty helps provide a list of which topics to include on the SCI

AP Statistics course outline also consulted for topic coverage

Gibbs criteria identify poorly written questions

Concurrent Validity
For Spring 2004 Three courses: 1 Engr, 2 Math

Course

SCI Pre

SCI Post

SCI Gain

SCI Norm.Gain

Engr (n=29)
Math #1 (n=30) Math #2 (n=26)

r = 0.060 (p = 0.758)
r = 0.323 (p = 0.081) r = 0.219 (p = 0.282)

r = 0.133 (p = 0.493)
r = 0.502** (p = 0.005) r = 0.384 (p = 0.053)

r = 0.080 (p = 0.679)
r = 0.316 (p = 0.089) r = 0.303 (p = 0.133)

r = 0.108 (p = 0.578)
r = 0.353 (p = 0.056) r = 0.336 (p = 0.094)

Construct Validity
Three-factor and four-factor FIML with general factor
Descriptive, inferential, probability, and graphical sub-tests Graphical a priori grouped with Descriptive in 3-factor Confirmatory Model Overall results: Item Uniqueness 70.1% and 70.4% Preference is for four-factor model because graphical items are a separate sub-test
More on this later!

Item Discrimination Index


Compares top quartile to bottom quartile on each item Generally around 1/3 of the items fall into each of the ranges poor (< 0.20), moderate (0.20 to 0.40) and high (> 0.40)

Item Analysis
Discrimination index Alpha-if-deleted
Reported by SPSS or SAS Shows how overall alpha would change if that one item were deleted

Answer distribution
Try to eliminate or improve choices which are consistently not chosen

Focus group comments

Understanding p-values
A researcher performs a t-test to test the following hypotheses: H :
0 0

H1 : 0

He rejects the null hypothesis and reports a p-value of 0.10. Which of the following must be correct?
a) The test statistic fell within the rejection region at the significance level b) The power of the test statistic used was 90% c) Assuming Ho: is true, there is a 10% possibility that the observed value is due to chance ** d) The probability that the null hypothesis is not true is 0.10 e) The probability that the null hypothesis is actually true is 0.9

Results for 4 classes


Pre #1 Post #1 Pre #2 Post #2 Pre #3 Post #3 Pre #4 Post #4 Choice a% Choice b% 15% 16% 41% 18% 32% 14% 52% 20% 5% 14% 67% 0% 17% 6% 18% 9%

Choice c%

41%

35% (-6%)
6%

41%

15% (-24%)
12%

62%

27% (-35%)
7%

47%

42% (-5%)
24%

Choice d%

18%

14%

19%

19%

Choice e%

2%

0%

0%

0%

0%

0%

11%

6%

Analysis
Discrimination
Pre: Post: 0.25, -0.17, 0.52, 0.15 0.00, -0.14, 0.25, 0.33

P-value question
Problems?
too definitional p-value taught from an interpretive standpoint
when to reject or not reject the null hypothesis

Therefore

New question
(not a replacement)

An engineer performs a hypothesis test and reports a p-value of 0.03. Based on a significance level of 0.05, what is the correct conclusion?
a)The null hypothesis is true. b)The alternate hypothesis is true. c) Do not reject the null hypothesis. d)Reject the null hypothesis **

Results of New Question


Discrimination better (post-test)
0.20, 0.29, 0.75, 0.12 still not great overall (0.19)

Percent correct and gains low


Post-test % correct (gain +/-%)
6% (-17%) 20% (-3%) 33% (+4%) 19% (+10%)

Moving on
Similar analyses were conducted for all items, a sort of bottom-up approach to developing the test For right or wrong, the test has changed very little since Spring 2004 No need to continually repeat the item analysis tables with such fine detail Lets see what else we can do with the SCI!

Exploring Reliability
Results (older, also presented during Proposal)
Demonstrated strong relationships between:
Alpha-if-deleted (a measure of item reliability) Discrimination index Gap the average total score of students who answered an item correctly <minus> average total score of students who answered an item incorrectly
Mean(correct) <minus> Mean(incorrect)

Focus groups tell us that guessing or using test-taking tricks are valid causal agents for poor item statistics
This meshes with theory because you would expect these items to have a Gap of zero, which lowers total-score variance

Online Test
From Proposal Defense Comparing online vs. paper Differences found
9 items Sub-test (except Probability) and overall scores Reliability: Probability and Inferential Problem with study: nearly all paper students at one university, which had very good overall results Differences are still not large

Other findings
Order effects
No systematic bias in question order No correlation between percent correct and order position Small (but significant) downward trend in answer confidence
Only about 5% of the total rating scale from beginning of test to the end

Round 2
Spring 2006, pre-test Two sections of same course, taught by same professor, took SCI on the same day
One paper (n=14), one online (n=16) Very similar demographics

Comparative results (next slide) Interesting finding


Online: time and number correct inversely related Paper: opposite Not rigorously assessed

Measure Reliability Probability Inferential Total Descriptive Inferential Graphical Probability #1 (Probability) #2 (Inferential) #3 (Descriptive) #7 (Graphical) #9 (Descriptive) #15 (Descriptive) #18 (Inferential) #19 (Inferential) #21 (Probability) #22 (Inferential) #28 (Graphical) #35 (Inferential) #36 (Inferential)

Mean Variance

Fall 2005

Spring 2006

Items

Problems
Fall 2005
Confounding with university

Spring 2006
Pre-test Small sample size

The Problem with Educational Research

Statistical Power

Rigour / Control

Chapter 8
Part A: Lit review
Background on difficulties Attitudes Reasoning skills: Probability
Kahneman & Tversky

Reasoning skills: Statistics Some teaching strategies

Chapter 8
Part B: Confidence on the SCI Original?
The reviewed studies are generally very specific and in-depth on a certain topic Or, they are very general as to why students have difficulties (e.g., attitudes) Nothing which provides a broad comparison identifying conceptual difficulties across statistics
So use the SCI to do this.

Chapter 8
Method
After students answer each question for the online SCI, the following is presented to them.

Results big picture

Results sample item


Which would be more likely to have 70% boys born on a given day: A small rural hospital or a large urban hospital? a) Rural b) Urban c) Equally likely d) Both are extremely unlikely

Rank 10th in correct (low) Rank 25th in confidence (high) Students are over-confident

Results the graphs

Results comparison
Kahneman & Tversky studied a very similar problem as part of the representativeness misconception of probability
Subjects do not appreciate that large samples are more likely to be representative of the population 20% correct, 56% equally likely
Smaller N of subjects, also inexperienced

SCI online: 37% correct, 45% equally likely

Chapter Nine
Questions
Is the SCI a multi-dimensional test? How does this compare to such a broad field as statistics? What implications can be drawn from the answers to the above?

Reliability
Common measure Cronbachs alpha is an under-estimate of reliability for a multidimensional test Other measures account for this
Theta based on largest eigenvalue from a principal component analysis Omega based communalities from a factor analysis, thus depends on the number of factors

1.00

0.95 omega 0.90


0.8907

0.85

0.8446

theta = 0.8123 0.80


0.8081 0.798

alpha = 0.7650 0.75 1 38

Results indicate multi-dimensionality But is it meaningful?

Exploratory Factor Analysis (EFA)


Many decisions to be made
Extraction method Number of factors Factor loadings Rotation method

Simple structure
Each variable ideally loads along exactly one factor Minimize number of variables per factor

Not the best paradigm


Other concept inventories have done it Curiosity (I wonder what all those options in SPSS mean.)

Decisions
Extraction method
Principal components chosen
Not ideal maximizes extracted variance Does not optimize the prediction of the overall correlation structure

Quick comparison of PC vs. ML


First-factor loadings from a four-factor solution had a mean absolute-difference of 0.030 (small)

Decisions
Number of factors
Eigenvalues > 1
Fifteen factors

Scree plot (next slide)


One Four Nine ?

Parallel analysis
Compare eigenvalues to random data One or four

0 1 38

Decisions
Assigning items to factors
How large does the factor loading need to be for the variable to be assigned to a given factor? Investigated 0.1 to 0.5

35 30 25 20 15 10 5 0 0.1 0.2 0.3 0.4 0.5

Decisions
Rotation
Orthogonal
Next slide: five-factor solution

Oblique
Involves extra parameters depending on method Following slide: promax rotation, with parameter Kappa

35 30 25 20 15 10 5 0 Unrotated Equamax Quartimax Varimax

35 30 25 20 15 10 5 0 2 3 4 5 6 8

Decisions
Number of factors
Four (unrotated) and five (rotated) best approximate simple structure One-dimensional structure is mostly likely, based on scree plot and parallel analysis

Factor loadings
Values around 0.32 best; use 0.30 for simplicity

Rotation
Unrotated: Varimax Rotated: Promax with Kappa =3

Conclusions
Items generally do not group in a meaningful way But, some pairs of highly similar items grouped along the same factors What now?

Confirmatory Factor Analysis (CFA)


Presumes the analyst has a pre-conceived notion of the underlying structure
Its probably not wise to write a test to assess a domain that you dont have a map of.

Decisions are made a priori, with model comparisons more formal

Models
1. Uni-dimensional 2. Capture finer clustering of similar items 3. Prior work concluded general factor plus four sub-topics

Statistics (G)

w1 w2 w3 w38

Q1
e1

Q2
e2

Q3
e3

Q38
e38

Statistics (G)

f1
Test statistics

w1 s2 s36

w38

Q1
e1

Q2
e2

Q36
e36

Q38
e38

Model Fit
Overall fit (chi-square)
Ho: model fits Function of sample size
Nearly always reject the null in practice

Fit indices
Alternate way to assess fit Too many to name

Results
Chi-sq d.f. (p) GFI PGFI

(1)
(2) (3)

785
771 682

665
659 617

0.0009 0.8805 0.8329


0.0017 0.8828 0.8275 0.0355 0.8952 0.7857

Conclusions
One-factor (1) is most parsimonious (2) not appreciably worse
Model is too sparse based on current SCI

(3) provides best overall fit, but with noticeable loss of parsimony
Different data Different methods Im not throwing it out still there for those so inclined

Problems
Used regular correlation instead of tetrachoric Normality violated Sample size too small Literature indicates these problems will affect the absolute magnitude of estimates but not the relative magnitudes
So its ok for what I did.

Proposal
Test Statistics Standard Deviation

Statistics
Correlation

Confidence Intervals

p value

Proposal
Could resemble original four proposed topic areas Youre gonna need a bigger boat.

Reliability Revisited
Based on the preferred uni-dimensional model, shortening the current SCI seems reasonable Use objective criteria
Discriminatory index Alpha-if-deleted Communalities Strong correspondence between metrics

Reliability Revisited
By alpha-if-deleted, 23 items is optimal length Selected 25 as my preferred length due to correspondence between metrics and because its a nice round number Cross-validation indicates a shorter SCI maintains the overall reliability
Full: 0.7650 Cut: 0.7655 (simulated based on 23-item SCI)

Chapter 10
Re-assess Content Validity Interviews Faculty survey

Interviews
Prefer interviews over focus groups because SCI does not involve group decision-making Informal approach IE grad students
Experienced statistics students Not hand-cuffed by pre/post timetable

Interviews
Sample item (retained) Text is ok, as opposed to symbols Un-anticipated approach
B is the more conservative test

Incorrect reasoning
D is what is good for the company
A bottling company believes a machine is under-filling 20-ounce bottles. What will be the alternate hypothesis to test this belief? a) On average, the bottles are being filled to 20 ounces. b) On average, the bottles are not being filled to 20 ounces. c) On average, the bottles are being filled with more than 20 ounces. d) On average, the bottles are being filled with less than 20 ounces.

A coin of unknown origin is flipped twelve times in a row, each time landing with heads up. What is the most likely outcome if the coin is flipped a thirteenth time? a) Tails, because even though for each flip heads and tails are equally likely, since there have been twelve heads, tails is slightly more likely b) Heads, because this coin has a pattern of landing heads up c) Tails, because in any sequence of tosses, there should be about the same number of heads and tails d) Heads and tails are equally likely

Interviews
Deleted item Context of coins seems ingrained 50/50
always; fixated But probability has to be different
Still answered D though!

Consideration of control!
But erred for gamblers fallacy

Recommend new context for this item

Faculty Survey
Rated the importance of 87 statistics topics on 1 to 4 scale 24 participants
IE faculty listserv and emailed SCI contacts Not at OU

Compared with previous survey conducted at OU

Results
Generally strong correspondence between old and new surveys
Correlation 0.69 ranks, 0.67 numbers Scales differ
New median 2.95, old median 2.61

Consider two surveys in tandem, using ranks

Based on 25 retained items

Results
16 topics ranked in Top 25 on both surveys
14 of these are covered Very good!

9 topics in Top 25 of new but not old


Only 2 topics covered Not so good

Exactly the same for old (2 of 9)

Conclusions
Pretty good coverage
Basing results on full 38 items is even better

Could help to survey non-engineers to allow comparisons


IE is the most statistically-inclined engineer, so thats the best audience if you are limited

Concept Inventories
Remember where we came from!
Whats the reference point?

How does the SCI compare to other concept inventories?


Especially others in engineering

Process

Process
From the author of a physics test (not FCI) Generally its pretty good but obviously a simplification
Many activities occur simultaneously Also I think you need to acknowledge that you enhance your validity, reliability, etc as you feedback

Sample Size
Compare SCI to other engineering concept inventories
Uncertainty: unpublished results

Statics is way ahead Speaks of generalizability of results We are in good shape

3000

Statics

2500

2000

Statistics 1500

1000

500

0 1 2 3 4

Scores and Reliability


Scores are low, but this is common in early-phase inventories
Higher scores typically found when teaching methods are assessed

Reliability in a similar range to other inventories


Between 0.70 and 0.80 Statistics seems more difficult to assess in one test (cf. factor analysis)

100 0.75 0.77 0.72 0.67 0.67 0.69 0.70 alpha 60 post-test 49.2 45.5 49.7 49.6 50.5 52.3 46.3 45.7

1.00

80

0.74

0.80

0.60

40

0.40

20

0.20

0
Su 2003 Fa 2003 Sp 2004 Su 2004 Fa 2004 Sp 2005 Su 2005 Fa 2005 n = 103 n = 280 n = 94 n = 16 n = 163 n = 260 n = 60 n = 429

0.00

CI Suggestions
Develop a sequence for related inventories
FCI / MBT Statics Dynamics / Strength of Materials Could Statistics fit with others?? Not currently.

Discuss who uses concept inventories


Colleagues? Friends? Outsiders? Speaks of instructor and thus student motivation

Analysis Techniques
Simple: discriminatory index, percent correct, correlations
Got it!

Advanced: factor analysis, SEM, IRT


Got it! Doesnt appear that anyone else has everything, although others have parts

Other Results
Andreas IRT (dissertation in Mathematics)
Analyze response probability by ability level, for each response Could this be integrated with confidence?? Pedagogical implications

Contributions
The SCI is an original creation
Part of the larger concept inventory scheme Draws on and allows comparisons to literature on statistics and probability reasoning

Analysis and synthesis of the creation process itself Insights into test reliability and validity

Publications
General development (Book One)
FIE 2003, ASEE 2004 conferences

Reliability (Chapter 6)
Under revision for JEE

Online test (Chapter 7)


Will be acknowledged as a data source in all future publications

Confidence (Chapter 8)
FIE 2006 (draft paper accepted pending revision) Theres much much more to pull from here, possibly incorporating interviews

Factor analysis and interviews / survey


Certainly offer proposals for future research. Not sure if publishable at present.

Concept Inventories (Chapter 11)


JEE?

Summary paper (Chapter 12)


JEE?

Process
The structure of the dissertation reflects the prevailing methods and conclusions used in constructing, analyzing, and adapting the SCI. There is meaning in this structure.
We couldnt have created the SCI without some background in test theory, cognitive research, etc. But very important the SCI also allowed us an avenue for further exploration. Chicken? or Egg? The methods evolved along with the instrument.

Criticisms
Lacks focus
I wanted to do EVERYTHING!!

Final chapters are open-ended, more like proposals than finished products
Phase II NSF grant ? Plus thats life.

No formal hypothesis
Question: Can you design a test to assess statistics concepts? Hypothesis: Yes, I can! Conclusion: Heres how I did it.

From the beginning.


Increase input and participation across departments and universities
Improving!

More lit review (Kahneman & Tversky, Pollatsek, Piaget, etc.)


Got it now!

Participation hindered by not teaching Intro Stats


Ditto! But: Does this introduce bias?

The Future
What is being taught? And how?
Instructor surveys (easy) Classroom observation (difficult) Integrate confidence ratings with IRT

Interviews / focus groups more often New items How long has it been??

You might also like