Dr. Geoffrey J. Gordon: What can machine learning do for open education?

WHAT CAN
MACHINE
LEARNING DO
FOR OPEN
EDUCATION?
Geoff Gordon
CMU Machine Learning
ggordon@cs.cmu.edu

Civilization advances by
extending the number of
important operations which
we can perform without
thinking about them.
—Alfred North Whitehead, 1911

Geoff Gordon—OCWC—April 2014
CONTRIBUTION OF ML
Machine learning can help us understand how students learn
4

CONTRIBUTION OF ML
‣ Not just any ML, but latent variable (“hidden feature”) discovery
4

CONTRIBUTION OF ML
‣ Not just any ML, but latent variable (“hidden feature”) discovery
‣ Not just any latent variables, but highly structured ones
4

WHY BOTHER?
Student feedback
‣ what does the student know?
‣ what are common causes for the mistake the student just made?
Instructor feedback
‣ what do the students know?
‣ what skills does this course content address?
‣ what skills doesn’t this course content address?
Evaluation
‣ help design rubrics for (peer, instructor) grading
‣ cluster submissions by similar approach, skill level, …
Etc…
5

GOAL: UNDERSTAND HOW STUDENTS
LEARN SOMETHING
6
10m
4m
5m
10m
3x + 4 = x + 10
John took Joe for a ride on his boat.
___ boat was blue with a red stripe.
A/The/[]

GOAL: UNDERSTAND HOW STUDENTS
LEARN SOMETHING
6
10m
4m
5m
10m
3x + 4 = x + 10
John took Joe for a ride on his boat.
___ boat was blue with a red stripe.
70m2
x = 3
The
A/The/[]

EX: GEOMETRY TUTOR
7
http://www.carnegielearning.com/

STEP-LEVEL DATA
Record right/wrong, timing, use of hints, …
8
one step = ﬁll in a box

SIMPLEST MODEL: RASCH /
1-PARAMETER ITEM RESPONSE THEORY
9
ln
✓
pt
1 pt
◆
= ✓it
+ jt
θ = student mean (knowledge level)

β = item mean (easy/difﬁcult)
it = student ID jt = step ID
pt = P(correct answer)

SIMPLEST MODEL: RASCH /
1-PARAMETER ITEM RESPONSE THEORYStudents
Steps in tutor Each entry: does student i get step j right?
1
10
θ
β
predict 1 if θi+βj > 0

STRUCTURE: SIMILARITY AMONG STEPS
Learn a “step map”: each point = 1 step
Steps over here are more
similar to each other…
… than to steps over here

STRUCTURE: SIMILARITY AMONG STEPS
Learn a “step map”: each point = 1 step
Steps over here are more
similar to each other…
… than to steps over here
hic sunt
dracones

HOW PRINCIPAL COMPONENTS ANAYSIS
GOT FAMOUS
Y1
Y2
Y3
.

.

.

Yn
Users
Movies
Each entry: how many stars does user i give
to movie j?
4
12

RESULT OF FACTORING
u1
u2
u3
.

.

.

un
v1
…
vk
Users
MoviesBasis weights
Basisvectors
Low-d basis = latent variables

!
Basis vectors represent latent properties of
movies, e.g.,“is a comedy”
13

IN OUR CASE (STUDENT-STEP DATA)
U1
U2
U3
.
.
.
UN
V1
…
VK
Students
Stepsbasis weights
basisvectors
Basis vectors are candidate “eigenskills”
Weights are students’ knowledge
levels
14

DOES IT WORK?
15
steps about pentagons
steps about circles
other steps.
Learned features let us
predict held-out data better
than chance
(ρ = .3, p < 0.0001)
step map

DOES IT WORK?
15
steps about pentagons
steps about circles
other steps.
Learned features let us
predict held-out data better
than chance
(ρ = .3, p < 0.0001)
Yes, sort of …
step map

STRUCTURE: PRACTICE MAKES PERFECT
PCA ignores step order — clearly wrong…
Add model of student learning to PCA
‣ based on “additive factor model” [Draney et al., 1995]
16

STRUCTURE: PRACTICE MAKES PERFECT
PCA ignores step order — clearly wrong…
Add model of student learning to PCA
‣ based on “additive factor model” [Draney et al., 1995]
Result: predictions of held-out data get slightly better
‣ ρ = .45 (p < 0.01 vs. plain PCA)
Step map still looks the same
Meh…
17

WHAT WE REALLY WANT
To be understandable to us humans, latents need to be sparse and
binary (‘is about circles’,‘requires subtracting areas’)
Can’t do this fully automatically from this small data set (only 59
students, 370 steps)
Challenge: can we discover sparse, binary, understandable latents
automatically from MOOC-scale data?
18

“KC HYPOTHESIS”
Knowledge comes in atomic units (“KCs”)
Each KC is learned independently (no transfer)
‣ transfer among steps mediated by common KCs
‣ or prerequisite structure (can’t learn algebra w/o knowing arithmetic)
Each student has a (latent, scalar) proﬁciency level for each KC
‣ learn/forget = transition to a higher/lower proﬁciency level
Learning a KC happens only through exposure to that KC
‣ problem, worked example, lecture, real life, …
19
step 17: {A, B}

step 23: {A, C}
[Koedinger, Corbett, Perfetti. Cognitive Science, 2012]

CONSEQUENCES OF KC HYPOTHESIS
Mistakes are at KC level: select wrong KC; apply right KC to wrong
data; mistake in application of KC
‣ identifying the KC at fault makes it easier to give student feedback
If we can accurately
‣ determine list of KCs
‣ label instructional activities by KCs
…then we immediately know the quality/coverage of our content
20

COMPOSE-BY-ADDITION
21
[Stamper & Koedinger,AIED 2011]

COMPOSE-BY-ADDITION
22
[Stamper&Koedinger,AIED2011]

WHY ARE SOME COMPOSE-BY-ADDITION
STEPS HARDER?
23
compose by
addition

WHY ARE SOME COMPOSE-BY-ADDITION
STEPS HARDER?
24
hard
easy
medium
compose by
addition

HYPOTHESIS: DIFFERENCE IS IN HOW
MUCH PLANNING IS NEEDED
25
plan to
compose
subtract
compose by
addition

KC DISCOVERY
26
t [4]. Other problems were “unscaffolded” and did not start with such
hus students had to pose these subgoals themselves. Indeed the blips for
y-addition (seen in the learning curve in Figure 2) do correspond with a
ency of these unscaffolded problems.
[Stamper&Koedinger,AIED2011]

USE DATA-DRIVEN MODEL TO
REDESIGN TUTOR
New skill bars for planning skills
‣ skill bars are a tutor interface to
show students where they are in
acquiring skills
Sequence for gentle slope
‣ adaptive fading of scaffolding
New problems that focus on planning
‣ next slide…
27
Combine areas
Enter given values
Find regular area
Plan to combine areas
Combine areas
Subtract
Enter given values
Find regular area

NEW PROBLEMS: ISOLATE PRACTICE ON
PLANNING STEP
Decompose complex problem into simpler ones
28

RESULTS
More efﬁcient: 25% less student time
‣ instructional time by step type
Better learning of planning skills
‣ post-test %correct by item type
29
428 K.R. Koedinger et al
(a)
Fig. 4. Students using the rede
28 minutes) while actually spe
learned these decomposition sk
tion problems
5 Discussion and C
Following our past demon
discovered from data [8; 1
model to redesign an adapt
ports the hypothesis. In p
reached mastery (as demon
.
esigned tutor reached master
ending more time on the criti
kills as demonstrated by bett
Conclusion
nstrations that better cogn
1], we have tested the h
tive tutor yields better stu
particular, we found stud
nstrated within the tutor a
428 K.R. Koedinger et al
(a)
.
(b)
time:minutes%correct
[Stamper & Koedinger,AIED 2011]

MORE STRUCTURE: WHAT’S IN A KC?
So far, each KC is just present or absent in a student or problem
Nothing to distinguish algebra KCs from ESL KCs
What’s going on under the hood as a student solves a problem?
30

RULE-BASED COGNITIVE MODEL
3(2x – 5) = 9
6x – 15 = 9 2x – 5 = 3 6x – 5 = 9
IF GOAL IS SOLVE A(BX+C) = D
THEN REWRITE AS ABX + AC = D
THEN REWRITE AS ABX + C = D
THEN REWRITE AS BX+C = D/AKCs
bug
31
What does it look like inside the student’s brain?
‣ … maybe a rule-based system
‣ … in which case KCs might correspond to rules
‣ :- president of US is Obama
‣ constant C on LHS of equation E :- move C to RHS of E

RULE-BASED SYSTEM
Aka production system:
‣ declarative knowledge held in working memory
‣ production rules match declarative knowledge
‣ and act on WM or external world
Much cognitive modeling work endorses this claim explicitly or implicitly
‣ ACT-R, SimStudent, Russell & Norvig, …
!
But two problems: uncertainty handling, representation learning
‣ here’s where more ML research can help!
32
“I see 3x+5 = 8”
“if LHS has constant C…”
“… then subtract C from both sides”

PROBLEM 1: UNCERTAINTY
A DAY IN THE LIFE OF A RAT
33
Trial Bell? Light? Food?
1 × ✓ ✓
2 ✓ × ×
3 × ✓ ×
4 × ✓ ✓
… … … …

RAT AS BAYESIAN
Priors over: how many trial types, sparsity of connections, reliability of
connections, …
(This is a common architecture for medical diagnosis systems)
34
bell light food …
1 2Trial types
Observables
…

QUIZ: ARE YOU SMARTER THAN A RAT?
35
1 × ✓ ✓
2 × ✓ ×
3 × ✓ ✓
4 × ✓ ✓
… … … …
100 × ✓ ✓

QUIZ: ARE YOU SMARTER THAN A RAT?
35
1 × ✓ ✓
2 × ✓ ×
3 × ✓ ✓
4 × ✓ ✓
… … … …
100 × ✓ ✓
101 ✓ ✓ ×

AND THE RAT SAYS…
Both right! With more light-bell trials, evidence increases for a
separate trial type.
36
Effect name
2nd-order
conditioning
Conditioned
inhibition
light-food trials many many
bell-light trials few many
test: bell predicts food? ↑ ↓

BAYESIAN RULE LEARNING IN CLASSICAL
CONDITIONING
Only fully Bayesian inference/learning
captured both effects
[Courville, Daw, Gordon,Touretzky, NIPS 2003]
Few bell-light trials, 1 trial type:
(bell, light, food) all associated
More trials: (bell, light, no food) v.
(light, food, no bell)
37
Number of bell-light trials
0 10 20 30 40 50 60
0
0.2
0.4
0.6
0.8
1
Number of A−X trials
P(US | A, D )
P(US | X, D )
(a) Second-order Cond.
P(food | light)
P(food | bell)

PROBLEM 2: REPRESENTATION LEARNING
38
Flaw with
“KC = rule”:
Many bugs
come from
weak
features

REPRESENTATION LEARNING
Some student errors come from failure to correctly interpret
(internally represent) a problem
As student sees more and more examples like 3x + 5 = 8, gets better
and better “language model” to explain them (build internal
representation)
—> some KCs must correspond to features of the improved language
model
39

EXPERIMENT
Present algebra examples to a machine learning system
As part of learning, induce a language model (an unsupervised
probabilistic context free grammar) for algebra equations
Make output of language model (grammar nonterminals, e.g.,
SignedNumber) available as features of each example
Use these features in simulated problem-solving to discover KCs
[Li, Cohen, Koedinger, Matsuda, 2010]

NEW COGNITIVE MODELS ARE
MORE ACCURATE
41
[Li, Cohen, Koedinger, Matsuda, 2010]

OPEN RESEARCH QUESTION
Can we build a new generation of rule-based system that has
‣ rich uncertainty handling
‣ integrated representation learning
… and use it to help us model student learning?
42

SUMMARY
A key contribution of machine learning to education will be to help
understand the educational content we’re creating and delivering
Essential idea: ML models of structured latent variables
Speciﬁcally, build and test hypotheses about the knowledge,
procedures and representations students use to solve problems
‣ latents = KCs, rules, representations, strategies, …
Need to link uncertainty handling (traditional domain of ML) to new,
harder situations encountered in understanding student knowledge
Exciting time for research in ML and education!
43

Dr. Geoffrey J. Gordon: What can machine learning do for open education?

More Related Content

Similar to Dr. Geoffrey J. Gordon: What can machine learning do for open education? (20)

More from The Open Education Consortium (20)

Recently uploaded (20)

Dr. Geoffrey J. Gordon: What can machine learning do for open education?