SlideShare a Scribd company logo
Your model is predictive but is it useful?−
Theoretical and Empirical Considerations of a
New Paradigm for Adaptive Tutoring Evaluation
José P. González-Brenes, Pearson
Yun Huang, University of Pittsburgh
1
Main point of the paper:
We are not evaluating student models
correctly
New paradigm, Leopard, may help
2
3
Why are we here?
4
5
Educational Data Mining
Data Mining
=	
  
?
6
Should we just publish at KDD*?
*or other data mining venue
7
Claim:
Educational Data Mining helps learners
8
9
Is our research helping learners?
Adaptive Intelligent Tutoring Systems:
Systems that teach and adapt content to
humans
teach	
  
collect	
  evidence	
  
10
Paper writing: Researchers quantify the
improvements of the systems compared
Not a purely academic pursuit:
Superintendents to choose between
alternative technology
Teachers choose between systems
11
Randomized Controlled Trials may measure
the time students spent on tutoring, and
their performance on post-tests
12
Difficulties of Randomized Controlled Trials:
•  IRB approval
•  experimental design by an expert
•  recruiting (and often payment!) of enough
participants to achieve statistical power
•  data analysis
13
How do other A.I. disciplines do it?
14
Bleu [Papineni et al ‘01]:
machine translation systems
Rouge [Lin et al ’02]:
automatic summarization systems
Paradise [Walker et al ’99]:
spoken dialogue systems
15
Using automatic metrics can be very
positive:
•  Cheaper experimentation
•  Faster comparisons
•  Competitions that accelerate progress
16
Automatic metrics do not replace RCTs
17
What does the Educational Data Mining
community do?
Evaluate the student model using
classification accuracy metrics like RMSE,
AUC of ROC, accuracy…
(Literature reviews by Pardos, Pelánek, … )
18
Other fields verify that automatic metrics
correlated with the target behavior
[Eg.: Callison-Burch et al ’06]
19
Ironically, we have a growing body of
evidence that classification evaluation
metrics are a BAD way to evaluate adaptive
tutors
Read Baker and Beck papers on limitations /
problems of these evaluation metrics
20
Surprisingly, in spite of all of the evidence
against using classification evaluation
metrics, their use is still very widespread in
the adaptive literature*
Can we do better?
* ExpOppNeed [Lee & Brunskill] is an exception
21
The rest of this talk:
•  Leopard Paradigm
– Teal
– White
•  Meta-evaluation
•  Discussion
22
Leopard
23
•  Leopard: Learner Effort-Outcomes
Paradigm
•  Leopard quantifies the effort and
outcomes of students in adaptive tutoring
24
•  Effort: Quantifies how much practice the
adaptive tutor gives to students. Eg.,
number of items assigned to students,
amount of time…
•  Outcome: Quantifies the performance of
students after adaptive tutoring
25
•  Measuring effort and outcomes is not
novel by itself (e.g, RCT)
•  Leopard’s contribution is measuring both
without a randomized control trial
•  White and Teal are metrics that
operationalize Leopard
26
White: Whole Intelligent Tutoring system
Evaluation
White performs a counterfactual simulation
(“What Would the Tutor Do?”) to estimate
how much practice students receive
27
28
Design desiderata:
Evaluation metric should be easy to use
Same, or similar input than conventional
metrics
29
Bob
Alice
skill q
1
student u
.5
Bob
s1
s1
Bob
Alice
s1
.7
Bob
s13
.7
s1
.8
6
1
yt
0
4
1
.6
1
1
Bob
2
3
4
Bob
.7
2
.9
1Bob
.5
t
1
Alice s1
1
0
s1
s1
s1
0
s1
s1
1
ct+1
.6
Alice 0
.4
.9
ˆ
actual performance
predicted performance
cu,q,t+1 yu,q,tˆ
30
Bob
Alice
skill q
1
student u
.5
Bob
s1
s1
Bob
Alice
s1
.7
Bob
s13
.7
s1
.8
6
1
yt
0
4
1
.6
1
1
Bob
2
3
4
Bob
.7
2
.9
1Bob
.5
t
1
Alice s1
1
0
s1
s1
s1
0
s1
s1
1
ct+1
.6
Alice 0
.4
.9
2/3
4/5
score
score
effort=
effort=
ˆ
actual performance
predicted performance
cu,q,t+1 yu,q,tˆ
Alternatively,
we can
model error
Future direction:
Present aggregate results
31
Q/ What if student does not achieve target
performance?
A: “Visible” imputation
32
Meta-Evaluation
33
Compare:
•  Conventional classification metrics
•  Leopard metrics (White)
34
Datasets:
•  Data from a middle-school Math commercial
tutor
–  1.2 million observations
–  25,000 students
–  Item to skill mapping:
•  Coarse: 27 skills
•  Fine: 90 skills
•  (Other item-to-skill model not reported)
•  Synthetic data
35
Assessing an evaluation metric with real
student data is difficult because we often do
not know the ground truth
Insight: Use data that we know a priori its
behavior in an adaptive tutor
36
For adaptive tutoring to be able to optimize
when to stop instruction, the student
performance should increase with repeated
practice (the learning curve should be
increasing)
Decreasing /flat learning curve = bad data
37
38
39
Procedure:
1.  Select skills with decreasing/flat learning
curve (aka bad data)
2.  Train a student model on those skills
3.  Compare classification metrics with
Leopard
40
F1 AUC Score Effort
Bad student model .79 .85
Majority class 0 .50
41
F1 AUC Score Effort
Bad student model .79 .85 .18 10.1
Majority class 0 .50 .18 11.2
42
What does this mean?
•  High accuracy models may not be useful
for adaptive tutoring
•  We need to change how we report results
in adaptive tutoring
43
Solutions
•  Report classification accuracy averaged
over skills (for models with 1 skill per item)
✖ Not useful for comparing or discovering
different skill models
•  Report as “difficulty” baseline
✖ Experiments suggest that models with
baseline performance can be useful
•  Use Leopard
44
Let’s use all data, and pick an item-to-skill
mapping:
AUC Score Effort
Coarse (27 skills) .69
Fine (90 skills) .74
45
Let’s use all data, and pick an item-to-skill
mapping:
AUC Score Effort
Coarse (27 skills) .69 .41	
   55.7	
  
Fine (90 skills) .74 .36	
   88.1	
  
46
With synthetic data we can use Teal as the
ground truth
We generate 500 synthetic datasets with
known Knowledge Tracing Parameters
Which metrics correlate best to the truth?
47
48
49
Discussion
50
51
In EDM 2014 we
proposed “FAST” toolkit
for Knowledge Tracing
with Features
“FAST model improves 25% AUC of ROC”
2015 EDM Leopard for Adaptive Tutoring Evaluation
54
Input
Teal White
Knowledge Tracing
Family parameters
Student’s correct or
incorrect response
Sequence length Student models’
prediction of correct/
incorrect
Target Probability of
correct
Target Probability of
correct
55

More Related Content

PDF
NUS Teaching Assistant Feedback: CS1010E (Andre Lim)
PPTX
Algebra team. 9.5.12
PPSX
Caveon webinar series Standard Setting for the 21st Century, Using Informa...
PPT
Experimental Research
PPT
Experimental Research
PDF
Teacher Evaluation Report for BT4016 (Tutorial)
PDF
Module 12 slideshare
PPTX
European Distance Learning Week: Getting course evaluations ‘just right’
NUS Teaching Assistant Feedback: CS1010E (Andre Lim)
Algebra team. 9.5.12
Caveon webinar series Standard Setting for the 21st Century, Using Informa...
Experimental Research
Experimental Research
Teacher Evaluation Report for BT4016 (Tutorial)
Module 12 slideshare
European Distance Learning Week: Getting course evaluations ‘just right’

What's hot (15)

PPTX
IACBE Conference Presentation-2015
PPTX
ICTCM 2013 Presentation -- Dan DuPort
PPTX
SBAC Performance Task Overview
PPT
Integrating Data Analysis at Berea College
PDF
Examview Features and benefits
PPTX
Fuzzy logic based students’ learning assessment
PPT
Mathadoption K5 Summary
PDF
Froehlke plan revised
PPTX
SANDE NDT Training School - Online Blended Learning
PPT
Fine tuning assessment for teaching and learning
PPT
Footholds and Foundations: Setting Freshmen on the Path to Lifelong Learning
PDF
Lab_student_evaluation_2016
DOCX
Ashford edu 645 week 4 dq 1 item analysis
PPTX
How to prove your edtech product works
PDF
An accurate ability evaluation method for every student with small problem it...
IACBE Conference Presentation-2015
ICTCM 2013 Presentation -- Dan DuPort
SBAC Performance Task Overview
Integrating Data Analysis at Berea College
Examview Features and benefits
Fuzzy logic based students’ learning assessment
Mathadoption K5 Summary
Froehlke plan revised
SANDE NDT Training School - Online Blended Learning
Fine tuning assessment for teaching and learning
Footholds and Foundations: Setting Freshmen on the Path to Lifelong Learning
Lab_student_evaluation_2016
Ashford edu 645 week 4 dq 1 item analysis
How to prove your edtech product works
An accurate ability evaluation method for every student with small problem it...
Ad

Similar to 2015 EDM Leopard for Adaptive Tutoring Evaluation (20)

PDF
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
PPTX
Learning to Teach: Improving Instruction with Machine Learning Techniques
PDF
Artificial Intelligence: an introduction.pdf
PPTX
Using Data-Driven Discovery Techniques for the Design and Improvement of Educ...
PPTX
Chapter 6 - Learning data and analytics course
PDF
2015EDM: A Framework for Multifaceted Evaluation of Student Models (Polygon)
PDF
Fd33935939
PDF
Fd33935939
PDF
Crowd Teaching with Imperfect Labels
PPTX
Introduction
PPTX
Introduction
PPTX
Introduction
PPT
Introduction to Machine Learning.
PPT
LECTURE8.PPT
PDF
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
PDF
Teacher-Aware Active Robot Learning
DOC
Lecture #1: Introduction to machine learning (ML)
PDF
IJMERT.pdf
PDF
scopus journal.pdf
PDF
Journal publications
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
Learning to Teach: Improving Instruction with Machine Learning Techniques
Artificial Intelligence: an introduction.pdf
Using Data-Driven Discovery Techniques for the Design and Improvement of Educ...
Chapter 6 - Learning data and analytics course
2015EDM: A Framework for Multifaceted Evaluation of Student Models (Polygon)
Fd33935939
Fd33935939
Crowd Teaching with Imperfect Labels
Introduction
Introduction
Introduction
Introduction to Machine Learning.
LECTURE8.PPT
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
Teacher-Aware Active Robot Learning
Lecture #1: Introduction to machine learning (ML)
IJMERT.pdf
scopus journal.pdf
Journal publications
Ad

More from Yun Huang (6)

PDF
Umap17 learner modelingforintegrationskills_yunhuang
PPTX
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
PDF
2014UMAP Student Modeling with Reduced Content Models
PDF
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
PDF
2014 EDM paper: The Problem Solving Genome: Analyzing Sequential Patterns of ...
PDF
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
Umap17 learner modelingforintegrationskills_yunhuang
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
2014UMAP Student Modeling with Reduced Content Models
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2014 EDM paper: The Problem Solving Genome: Analyzing Sequential Patterns of ...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Managing Community Partner Relationships
PDF
Microsoft Core Cloud Services powerpoint
PPTX
modul_python (1).pptx for professional and student
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Database Infoormation System (DBIS).pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Leprosy and NLEP programme community medicine
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Introduction to the R Programming Language
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Transcultural that can help you someday.
[EN] Industrial Machine Downtime Prediction
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IMPACT OF LANDSLIDE.....................
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Managing Community Partner Relationships
Microsoft Core Cloud Services powerpoint
modul_python (1).pptx for professional and student
CYBER SECURITY the Next Warefare Tactics
Database Infoormation System (DBIS).pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Leprosy and NLEP programme community medicine
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Introduction to the R Programming Language
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
A Complete Guide to Streamlining Business Processes
importance of Data-Visualization-in-Data-Science. for mba studnts
Transcultural that can help you someday.

2015 EDM Leopard for Adaptive Tutoring Evaluation

  • 1. Your model is predictive but is it useful?− Theoretical and Empirical Considerations of a New Paradigm for Adaptive Tutoring Evaluation José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh 1
  • 2. Main point of the paper: We are not evaluating student models correctly New paradigm, Leopard, may help 2
  • 3. 3
  • 4. Why are we here? 4
  • 5. 5
  • 7. Should we just publish at KDD*? *or other data mining venue 7
  • 9. 9 Is our research helping learners?
  • 10. Adaptive Intelligent Tutoring Systems: Systems that teach and adapt content to humans teach   collect  evidence   10
  • 11. Paper writing: Researchers quantify the improvements of the systems compared Not a purely academic pursuit: Superintendents to choose between alternative technology Teachers choose between systems 11
  • 12. Randomized Controlled Trials may measure the time students spent on tutoring, and their performance on post-tests 12
  • 13. Difficulties of Randomized Controlled Trials: •  IRB approval •  experimental design by an expert •  recruiting (and often payment!) of enough participants to achieve statistical power •  data analysis 13
  • 14. How do other A.I. disciplines do it? 14
  • 15. Bleu [Papineni et al ‘01]: machine translation systems Rouge [Lin et al ’02]: automatic summarization systems Paradise [Walker et al ’99]: spoken dialogue systems 15
  • 16. Using automatic metrics can be very positive: •  Cheaper experimentation •  Faster comparisons •  Competitions that accelerate progress 16
  • 17. Automatic metrics do not replace RCTs 17
  • 18. What does the Educational Data Mining community do? Evaluate the student model using classification accuracy metrics like RMSE, AUC of ROC, accuracy… (Literature reviews by Pardos, Pelánek, … ) 18
  • 19. Other fields verify that automatic metrics correlated with the target behavior [Eg.: Callison-Burch et al ’06] 19
  • 20. Ironically, we have a growing body of evidence that classification evaluation metrics are a BAD way to evaluate adaptive tutors Read Baker and Beck papers on limitations / problems of these evaluation metrics 20
  • 21. Surprisingly, in spite of all of the evidence against using classification evaluation metrics, their use is still very widespread in the adaptive literature* Can we do better? * ExpOppNeed [Lee & Brunskill] is an exception 21
  • 22. The rest of this talk: •  Leopard Paradigm – Teal – White •  Meta-evaluation •  Discussion 22
  • 24. •  Leopard: Learner Effort-Outcomes Paradigm •  Leopard quantifies the effort and outcomes of students in adaptive tutoring 24
  • 25. •  Effort: Quantifies how much practice the adaptive tutor gives to students. Eg., number of items assigned to students, amount of time… •  Outcome: Quantifies the performance of students after adaptive tutoring 25
  • 26. •  Measuring effort and outcomes is not novel by itself (e.g, RCT) •  Leopard’s contribution is measuring both without a randomized control trial •  White and Teal are metrics that operationalize Leopard 26
  • 27. White: Whole Intelligent Tutoring system Evaluation White performs a counterfactual simulation (“What Would the Tutor Do?”) to estimate how much practice students receive 27
  • 28. 28 Design desiderata: Evaluation metric should be easy to use Same, or similar input than conventional metrics
  • 29. 29 Bob Alice skill q 1 student u .5 Bob s1 s1 Bob Alice s1 .7 Bob s13 .7 s1 .8 6 1 yt 0 4 1 .6 1 1 Bob 2 3 4 Bob .7 2 .9 1Bob .5 t 1 Alice s1 1 0 s1 s1 s1 0 s1 s1 1 ct+1 .6 Alice 0 .4 .9 ˆ actual performance predicted performance cu,q,t+1 yu,q,tˆ
  • 30. 30 Bob Alice skill q 1 student u .5 Bob s1 s1 Bob Alice s1 .7 Bob s13 .7 s1 .8 6 1 yt 0 4 1 .6 1 1 Bob 2 3 4 Bob .7 2 .9 1Bob .5 t 1 Alice s1 1 0 s1 s1 s1 0 s1 s1 1 ct+1 .6 Alice 0 .4 .9 2/3 4/5 score score effort= effort= ˆ actual performance predicted performance cu,q,t+1 yu,q,tˆ Alternatively, we can model error
  • 32. Q/ What if student does not achieve target performance? A: “Visible” imputation 32
  • 34. Compare: •  Conventional classification metrics •  Leopard metrics (White) 34
  • 35. Datasets: •  Data from a middle-school Math commercial tutor –  1.2 million observations –  25,000 students –  Item to skill mapping: •  Coarse: 27 skills •  Fine: 90 skills •  (Other item-to-skill model not reported) •  Synthetic data 35
  • 36. Assessing an evaluation metric with real student data is difficult because we often do not know the ground truth Insight: Use data that we know a priori its behavior in an adaptive tutor 36
  • 37. For adaptive tutoring to be able to optimize when to stop instruction, the student performance should increase with repeated practice (the learning curve should be increasing) Decreasing /flat learning curve = bad data 37
  • 38. 38
  • 39. 39
  • 40. Procedure: 1.  Select skills with decreasing/flat learning curve (aka bad data) 2.  Train a student model on those skills 3.  Compare classification metrics with Leopard 40
  • 41. F1 AUC Score Effort Bad student model .79 .85 Majority class 0 .50 41
  • 42. F1 AUC Score Effort Bad student model .79 .85 .18 10.1 Majority class 0 .50 .18 11.2 42
  • 43. What does this mean? •  High accuracy models may not be useful for adaptive tutoring •  We need to change how we report results in adaptive tutoring 43
  • 44. Solutions •  Report classification accuracy averaged over skills (for models with 1 skill per item) ✖ Not useful for comparing or discovering different skill models •  Report as “difficulty” baseline ✖ Experiments suggest that models with baseline performance can be useful •  Use Leopard 44
  • 45. Let’s use all data, and pick an item-to-skill mapping: AUC Score Effort Coarse (27 skills) .69 Fine (90 skills) .74 45
  • 46. Let’s use all data, and pick an item-to-skill mapping: AUC Score Effort Coarse (27 skills) .69 .41   55.7   Fine (90 skills) .74 .36   88.1   46
  • 47. With synthetic data we can use Teal as the ground truth We generate 500 synthetic datasets with known Knowledge Tracing Parameters Which metrics correlate best to the truth? 47
  • 48. 48
  • 49. 49
  • 51. 51 In EDM 2014 we proposed “FAST” toolkit for Knowledge Tracing with Features
  • 52. “FAST model improves 25% AUC of ROC”
  • 54. 54
  • 55. Input Teal White Knowledge Tracing Family parameters Student’s correct or incorrect response Sequence length Student models’ prediction of correct/ incorrect Target Probability of correct Target Probability of correct 55