SlideShare a Scribd company logo
Educational Data MiningMarch 3, 2010
Today’s ClassEDMAssignment#5Mega-Survey
Educational Data Mining“Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.” www.educationaldatamining.org
Classes of EDM Method(Romero & Ventura, 2007)Information VisualizationWeb miningClustering, Classification, Outlier DetectionAssociation Rule Mining/Sequential Pattern MiningText Mining
Classes of EDM Method(Baker & Yacef, 2009)PredictionClusteringRelationship MiningDiscovery with ModelsDistillation of Data For Human Judgment
PredictionDevelop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)Which students are using CVS?Which students will fail the class?
ClusteringFind points that naturally group together, splitting full data set into set of clustersUsually used when nothing is known about the structure of the dataWhat behaviors are prominent in domain?What are the main groups of students?
Relationship MiningDiscover relationships between variables in a data set with many variablesAssociation rule miningCorrelation miningSequential pattern miningCausal data miningBeck & Mostow (2008) article is a great example of this
Discovery with ModelsPre-existing model (developed with EDM prediction methods… or clustering… or knowledge engineering)Applied to data and used as a component in another analysis
Distillation of Data for Human JudgmentMaking complex data understandable by humans to leverage their judgmentText replays are a simple example of this
Focus of today’s classPredictionClusteringRelationship MiningDiscovery with ModelsDistillation of Data For Human JudgmentThere will be a term-long class on this, taught by Joe Beck, in coordination with Carolina Ruiz’s Data Mining class, in a future yearStrongly recommended
PredictionPretty much what it saysA student is using a tutor right now.Is he gaming the system or not?A student has used the tutor for the last half hour.	How likely is it that she knows the knowledge component in the next step?A student has completed three years of high school.	What will be her score on the SAT-Math exam?
Two Key Types of PredictionThis slide adapted from slide by Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials
ClassificationGeneral IdeaCanonical MethodsAssessmentWays to do assessment wrong
ClassificationThere is something you want to predict (“the label”)The thing you want to predict is categoricalThe answer is one of a set of categories, not a numberCORRECT/WRONG (sometimes expressed as 0,1)HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVEWILL DROP OUT/WON’T DROP OUTWILL SELECT PROBLEM A,B,C,D,E,F, or G
ClassificationAssociated with each label are a set of “features”, which maybe you can use to predict the labelSkill	pknow		time		totalactions	rightENTERINGGIVEN	0.704		9		1		WRONGENTERINGGIVEN	0.502		10		2		RIGHT	USEDIFFNUM	0.049		6		1		WRONG	ENTERINGGIVEN	0.967		7		3		RIGHT	REMOVECOEFF	0.792		16		1		WRONG	REMOVECOEFF	0.792		13		2		RIGHT	USEDIFFNUM	0.073		5		2		RIGHT	….
ClassificationThe basic idea of a classifier is to determine which features, in which combination, can predict the labelSkill		pknow		time		totalactions	rightENTERINGGIVEN	0.704		9		1		WRONGENTERINGGIVEN	0.502		10		2		RIGHT	USEDIFFNUM	0.049		6		1		WRONG	ENTERINGGIVEN	0.967		7		3		RIGHT	REMOVECOEFF	0.792		16		1		WRONG	REMOVECOEFF	0.792		13		2		RIGHT	USEDIFFNUM	0.073		5		2		RIGHT	….
ClassificationOf course, usually there are more than 4 featuresAnd more than 7 actions/data pointsI’ve recently done analyses with 800,000 student actions, and 26 features
ClassificationOf course, usually there are more than 4 featuresAnd more than 7 actions/data pointsI’ve recently done analyses with 800,000 student actions, and 26 features5 years ago that would’ve been a lot of dataThese days, in the EDM world, it’s just a medium-sized data set
ClassificationOne way to classify is with a Decision Tree (like J48) PKNOW<0.5>=0.5TIMETOTALACTIONS<6s.>=6s.<4>=4RIGHTRIGHTWRONGWRONG
ClassificationOne way to classify is with a Decision Tree (like J48) PKNOW<0.5>=0.5TIMETOTALACTIONS<6s.>=6s.<4>=4RIGHTRIGHTWRONGWRONGSkill		pknow		time		totalactions	rightCOMPUTESLOPE	0.544		9		1		?
ClassificationAnother way to classify is with step regressionLinear regression (discussed later), with a cut-off
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning package
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKA
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMiner
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEEL
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMiner
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMinerRapidMiner
And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMinerRapidMinerRapidMiner
Comments? Questions?
How can you tell if a classifier is any good?
How can you tell if a classifier is any good?What about accuracy?     # correct classifications 	total number of classifications9200 actions were classified correctly, out of 10000 actions = 92% accuracy, and we declare victory.
What are some limitations of accuracy?
Biased training setWhat if the underlying distribution that you were trying to predict was:9200 correct actions, 800 wrong actionsAnd your model predicts that every action is correctYour model will have an accuracy of 92%Is the model actually any good?
What are some  alternate metrics you could use?
What are some  alternate metrics you could use?Kappa(Accuracy – Expected Accuracy)        (1 – Expected Accuracy)
What are some  alternate metrics you could use?A’The probability that if the model is given an example from each category, it will accurately identify which is which
ComparisonKappaeasier to computeworks for an unlimited number of categorieswacky behavior when things are worse than chancedifficult to compare two kappas in different data sets (K=0.6 is not always better than K=0.5)
ComparisonA’more difficult to computeonly works for two categories (without complicated extensions)meaning is invariant across data sets (A’=0.6 is always better than A’=0.55)very easy to interpret statistically
Comments? Questions?
What data set should you generally test on?A vote…Raise your hands as many times as you like
What data set should you generally test on?The data set you trained your classifier onA data set from a different tutorSplit your data set in half (by students), train on one half, test on the other halfSplit your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. Votes?
What data set should you generally test on?The data set you trained your classifier onA data set from a different tutorSplit your data set in half (by students), train on one half, test on the other halfSplit your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. What are the benefits and drawbacks of each?
The dangerous one(though still sometimes OK)The data set you trained your classifier onIf you do this, there is serious danger of over-fitting
The dangerous one(though still sometimes OK)You have ten thousand data points. You fit a parameter for each data point.“If data point 1, RIGHT. If data point 78, WRONG…”Your accuracy is 100%Your kappa is 1Your model will neither work on new data, nor will it tell you anything.
The dangerous one(though still sometimes OK)The data set you trained your classifier on When might this one still be OK?
K-fold cross validation (standard)Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?
K-fold cross validation (standard)Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?Your detector will work with new data from the same students
K-fold cross validation (student-level)Split your data set in half (by student), train on one half, test on the other halfWhat can you infer from this?
K-fold cross validation (student-level)Split your data set in half (by student), train on one half, test on the other halfWhat can you infer from this?Your detector will work with data from new students from the same population (whatever it was)
A data set from a different tutorThe most stringent testWhen your model succeeds at this test, you know you have a good/general modelWhen it fails, it’s sometimes hard to know why
An interesting alternativeLeave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006)Train on data from 3 or more tutorsTest on data from a different tutor(Repeat for all possible combinations)Good for giving a picture of how well your model will perform in new lessons
Comments? Questions?
Regression
RegressionThere is something you want to predict (“the label”)The thing you want to predict is numericalNumber of hints student requestsHow long student takes to answerWhat will the student’s test score be
RegressionAssociated with each label are a set of “features”, which maybe you can use to predict the labelSkill	pknow		time		totalactionsnumhintsENTERINGGIVEN	0.704		9		1		0ENTERINGGIVEN	0.502		10		2		0	USEDIFFNUM	0.049		6		1		3	ENTERINGGIVEN	0.967		7		3		0	REMOVECOEFF	0.792		16		1		1	REMOVECOEFF	0.792		13		2		0	USEDIFFNUM	0.073		5		2		0	….
RegressionThe basic idea of regression is to determine which features, in which combination, can predict the label’s valueSkill	pknow		time		totalactionsnumhintsENTERINGGIVEN	0.704		9		1		0ENTERINGGIVEN	0.502		10		2		0	USEDIFFNUM	0.049		6		1		3	ENTERINGGIVEN	0.967		7		3		0	REMOVECOEFF	0.792		16		1		1	REMOVECOEFF	0.792		13		2		0	USEDIFFNUM	0.073		5		2		0	….
Linear RegressionThe most classic form of regression is linear regressionAlternatives include Poisson regression, Neural Networks...
Linear RegressionThe most classic form of regression is linear regressionNumhints = 0.12*Pknow + 0.932*Time – 		      0.11*TotalactionsSkill		pknow		time		totalactionsnumhintsCOMPUTESLOPE	0.544		9		1		?
Linear RegressionLinear regression only fits linear functions (except when you apply transforms to the input variables, which RapidMiner can do for you…)
Linear RegressionHowever…It is blazing fastIt is often more accurate than more complex models, particularly once you cross-validateData Mining’s “Dirty Little Secret”It is feasible to understand your model(with the caveat that the second feature in your model is in the context of the first feature, and so on)
Example of CaveatLet’s study a classic example
Example of CaveatLet’s study a classic exampleDrinking too much prune nog at a party, and having an emergency trip to the Little Researcher’s Room
Data
DataSome people are resistent to the deletrious effects of prunes and can safely enjoy high quantities of prune nog!
Learned FunctionProbability of “emergency”=	0.25 * # Drinks of nog last 3 hours	- 0.018 * (Drinks of nog last 3 hours)2But does that actually mean that  (Drinks of nog last 3 hours)2 is associated with less “emergencies”?
Learned FunctionProbability of “emergency”=	0.25 * # Drinks of nog last 3 hours	- 0.018 * (Drinks of nog last 3 hours)2But does that actually mean that  (Drinks of nog last 3 hours)2 is associated with less “emergencies”?No!
Example of Caveat(Drinks of nog last 3 hours)2 is actually positively correlated with emergencies!r=0.59
Example of CaveatThe relationship is only in the negative direction when (Drinks of nog last 3 hours) is already in the model…
Example of CaveatSo be careful when interpreting linear regression models (or almost any other type of model)
Comments? Questions?
Discovery with Models
Why do Discovery with Models?Let’s say you have a model of some construct of interest or importanceKnowledgeMeta-CognitionMotivationAffectInquiry SkillCollaborative BehaviorEtc.
Why do Discovery with Models?You can use that model toFind outliers of interest by finding out where the model makes extreme predictionsInspect the model to learn what factors are involved in predicting the constructFind out the construct’s relationship to other constructs of interest, by studying its correlations/associations/causal relationships with data/models on the other constructsStudy the construct across contexts or students, by applying the model within data from those contexts or studentsAnd more…
Most frequentlyDone using prediction modelsThough other types of models (in particular knowledge engineering models) are amenable to this as well!
Boosting
BoostingLet’s say that you have 300 labeled actions randomly sampled from 600,000 overall actions Not a terribly unusual case, in these days of massive data sets, like those in the PSLC DataShopYou can train the model on the 300, cross-validate it, and then apply it to all 600,000And then analyze the model across all actionsMakes it possible to study larger-scale problems than a human could do without computer assistanceEspecially nice if you have some unlabeled data set with nice propertiesFor example, additional data such as questionnaire data(cf. Baker, Walonoski, Heffernan, Roll, Corbett, & Koedinger, 2008)
However…To do this and trust the result,You should validate that the model can transfer across students, populations, and to the learning software you’re usingAs discussed earlier
A few examples…
Middle School Gaming Detector
Skills from the Algebra TutorProbability of learning skill at each opportunityInitial probability of knowing skill
Which skills could probably be removed from the tutor?
Which skills could use better instruction?
Comments? Questions?
A lengthier example (if there’s time)Applying Baker et al’s (2008) gaming detector across contexts
Research QuestionDo students game the system because of state or trait factors?If trait factors are the main explanation, differences between students will explain much of the variance in gamingIf state factors are the main explanation, differences between lessons could account for many (but not all) state factors, and explain much of the variance in gamingSo: is the student or the lesson a better predictor of gaming?
Application of DetectorAfter validating its transferWe applied the gaming detector across 35 lessons, used by 240 students, from a single Cognitive TutorGiving us, for each student in each lesson, a gaming frequency
ModelLinear Regression modelsGaming frequency = Lesson + a0Gaming frequency = Student + a0
ModelCategorical variables transformed to a set of binariesi.e. Lesson = Scatterplot becomes3DGeometry = 0Percents = 0Probability = 0Scatterplot = 1Boxplot = 0Etc…
Metrics
r2The correlation, squaredThe proportion of variability in the data set that is accounted for by a statistical model
r2The correlation, squaredThe proportion of variability in the data set that is accounted for by a statistical model
r2However, a limitationThe more variables you have, the more variance you should be expected to predict, just by chance
r2We should expect240 students To predict gaming better than35 lessonsJust by overfitting
So what can we do?
BiCBayesian Information Criterion(Raftery, 1995)Makes trade-off between goodness of fit and flexibility of fit (number of parameters)
Predictors
The Lesson Gaming frequency = Lesson + a035 parametersr2 = 0.55BiC’ = -2370Model is significantly better than chance would predict given model size & data set size
The StudentGaming frequency = Student + a0240 parametersr2 = 0.16BiC’ = 1382Model is worse than chance would predict given model size & data set size!
Standard deviation bars, not standard error bars
Comments? Questions?
EDM – where?HolisticExistentialistEssentialistEntitative
Today’s ClassEDMAssignment#5Mega-Survey
Any questions?
Today’s ClassEDMAssignment#5Mega-Survey
Mega-SurveyI need a volunteer to bring these surveys to Jim Doyle after class*NOT THE REGISTRAR*
Mega-Survey Additional Questions(See back)#1: In future years, should this class be given 1: In half a semester, as part of a unified semester class, along with Professor Skorinko’s Research Methods class3: Unsure/neutral5: As a full-semester class, with Professor Skorinko’s class as a prerequisite#2: Are there any topics you think should be dropped from this class? [write your answer in the space to the right]#3: Are there any topics you think should be added to this class? [write your answer in the space to the right]

More Related Content

PPTX
Supervised learning and unsupervised learning
PDF
Lecture 9: Machine Learning in Practice (2)
PDF
Data mining chapter04and5-best
PDF
Module 4: Model Selection and Evaluation
PPTX
Introduction
PPT
activelearning.ppt
PDF
Top 10 Data Science Practitioner Pitfalls
PDF
Introduction to machine learning and deep learning
Supervised learning and unsupervised learning
Lecture 9: Machine Learning in Practice (2)
Data mining chapter04and5-best
Module 4: Model Selection and Evaluation
Introduction
activelearning.ppt
Top 10 Data Science Practitioner Pitfalls
Introduction to machine learning and deep learning

What's hot (19)

PPTX
Introduction to Machine Learning
PPT
Machine Learning presentation.
PDF
Module 1.2 data preparation
PDF
Lecture 2 Basic Concepts in Machine Learning for Language Technology
PPTX
Supervised learning
PPTX
Machine learning and types
PPTX
Active learning
PDF
Machine Learning for Dummies
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PDF
Introduction to Machine Learning Classifiers
PDF
Machine Learning - Deep Learning
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PDF
L15. Machine Learning - Black Art
PPTX
Top 10 Data Science Practioner Pitfalls - Mark Landry
PPTX
Machine Learning
PPTX
Machine learning - session 3
PPT
Lecture 7
PDF
Barga Data Science lecture 5
Introduction to Machine Learning
Machine Learning presentation.
Module 1.2 data preparation
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Supervised learning
Machine learning and types
Active learning
Machine Learning for Dummies
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Introduction to Machine Learning Classifiers
Machine Learning - Deep Learning
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
L15. Machine Learning - Black Art
Top 10 Data Science Practioner Pitfalls - Mark Landry
Machine Learning
Machine learning - session 3
Lecture 7
Barga Data Science lecture 5
Ad

Similar to Think-Aloud Protocols (20)

PPTX
WEKA:Credibility Evaluating Whats Been Learned
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPT
clustering, k-mean clustering, confusion matrices
PDF
13ClassifierPerformance.pdf
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PPTX
How to Win Machine Learning Competitions ?
PPTX
DECISION TREE AND PROBABILISTIC MODELS.pptx
PDF
introducatio to ml introducatio to ml introducatio to ml
PPTX
Lecture 09(introduction to machine learning)
PPTX
Experiental learning.pptx with expert learning
PDF
lec21.pdf
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
Data Science Cheatsheet.pdf
PPTX
dataminingclassificationprediction123 .pptx
PPT
3 DM Classification HFCS kilometres .ppt
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPT
[ppt]
WEKA:Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
clustering, k-mean clustering, confusion matrices
13ClassifierPerformance.pdf
Textmining Predictive Models
Textmining Predictive Models
Textmining Predictive Models
How to Win Machine Learning Competitions ?
DECISION TREE AND PROBABILISTIC MODELS.pptx
introducatio to ml introducatio to ml introducatio to ml
Lecture 09(introduction to machine learning)
Experiental learning.pptx with expert learning
lec21.pdf
UNIT 3: Data Warehousing and Data Mining
Data Science Cheatsheet.pdf
dataminingclassificationprediction123 .pptx
3 DM Classification HFCS kilometres .ppt
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
[ppt]
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Think-Aloud Protocols

  • 3. Educational Data Mining“Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.” www.educationaldatamining.org
  • 4. Classes of EDM Method(Romero & Ventura, 2007)Information VisualizationWeb miningClustering, Classification, Outlier DetectionAssociation Rule Mining/Sequential Pattern MiningText Mining
  • 5. Classes of EDM Method(Baker & Yacef, 2009)PredictionClusteringRelationship MiningDiscovery with ModelsDistillation of Data For Human Judgment
  • 6. PredictionDevelop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)Which students are using CVS?Which students will fail the class?
  • 7. ClusteringFind points that naturally group together, splitting full data set into set of clustersUsually used when nothing is known about the structure of the dataWhat behaviors are prominent in domain?What are the main groups of students?
  • 8. Relationship MiningDiscover relationships between variables in a data set with many variablesAssociation rule miningCorrelation miningSequential pattern miningCausal data miningBeck & Mostow (2008) article is a great example of this
  • 9. Discovery with ModelsPre-existing model (developed with EDM prediction methods… or clustering… or knowledge engineering)Applied to data and used as a component in another analysis
  • 10. Distillation of Data for Human JudgmentMaking complex data understandable by humans to leverage their judgmentText replays are a simple example of this
  • 11. Focus of today’s classPredictionClusteringRelationship MiningDiscovery with ModelsDistillation of Data For Human JudgmentThere will be a term-long class on this, taught by Joe Beck, in coordination with Carolina Ruiz’s Data Mining class, in a future yearStrongly recommended
  • 12. PredictionPretty much what it saysA student is using a tutor right now.Is he gaming the system or not?A student has used the tutor for the last half hour. How likely is it that she knows the knowledge component in the next step?A student has completed three years of high school. What will be her score on the SAT-Math exam?
  • 13. Two Key Types of PredictionThis slide adapted from slide by Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials
  • 15. ClassificationThere is something you want to predict (“the label”)The thing you want to predict is categoricalThe answer is one of a set of categories, not a numberCORRECT/WRONG (sometimes expressed as 0,1)HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVEWILL DROP OUT/WON’T DROP OUTWILL SELECT PROBLEM A,B,C,D,E,F, or G
  • 16. ClassificationAssociated with each label are a set of “features”, which maybe you can use to predict the labelSkill pknow time totalactions rightENTERINGGIVEN 0.704 9 1 WRONGENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….
  • 17. ClassificationThe basic idea of a classifier is to determine which features, in which combination, can predict the labelSkill pknow time totalactions rightENTERINGGIVEN 0.704 9 1 WRONGENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….
  • 18. ClassificationOf course, usually there are more than 4 featuresAnd more than 7 actions/data pointsI’ve recently done analyses with 800,000 student actions, and 26 features
  • 19. ClassificationOf course, usually there are more than 4 featuresAnd more than 7 actions/data pointsI’ve recently done analyses with 800,000 student actions, and 26 features5 years ago that would’ve been a lot of dataThese days, in the EDM world, it’s just a medium-sized data set
  • 20. ClassificationOne way to classify is with a Decision Tree (like J48) PKNOW<0.5>=0.5TIMETOTALACTIONS<6s.>=6s.<4>=4RIGHTRIGHTWRONGWRONG
  • 21. ClassificationOne way to classify is with a Decision Tree (like J48) PKNOW<0.5>=0.5TIMETOTALACTIONS<6s.>=6s.<4>=4RIGHTRIGHTWRONGWRONGSkill pknow time totalactions rightCOMPUTESLOPE 0.544 9 1 ?
  • 22. ClassificationAnother way to classify is with step regressionLinear regression (discussed later), with a cut-off
  • 23. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning package
  • 24. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKA
  • 25. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMiner
  • 26. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEEL
  • 27. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMiner
  • 28. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMinerRapidMiner
  • 29. And of course…There are lots of other classification algorithms you can use...SMO (support vector machine)In your favorite Machine Learning packageWEKARapidMinerKEELRapidMinerRapidMinerRapidMiner
  • 31. How can you tell if a classifier is any good?
  • 32. How can you tell if a classifier is any good?What about accuracy? # correct classifications total number of classifications9200 actions were classified correctly, out of 10000 actions = 92% accuracy, and we declare victory.
  • 33. What are some limitations of accuracy?
  • 34. Biased training setWhat if the underlying distribution that you were trying to predict was:9200 correct actions, 800 wrong actionsAnd your model predicts that every action is correctYour model will have an accuracy of 92%Is the model actually any good?
  • 35. What are some alternate metrics you could use?
  • 36. What are some alternate metrics you could use?Kappa(Accuracy – Expected Accuracy) (1 – Expected Accuracy)
  • 37. What are some alternate metrics you could use?A’The probability that if the model is given an example from each category, it will accurately identify which is which
  • 38. ComparisonKappaeasier to computeworks for an unlimited number of categorieswacky behavior when things are worse than chancedifficult to compare two kappas in different data sets (K=0.6 is not always better than K=0.5)
  • 39. ComparisonA’more difficult to computeonly works for two categories (without complicated extensions)meaning is invariant across data sets (A’=0.6 is always better than A’=0.55)very easy to interpret statistically
  • 41. What data set should you generally test on?A vote…Raise your hands as many times as you like
  • 42. What data set should you generally test on?The data set you trained your classifier onA data set from a different tutorSplit your data set in half (by students), train on one half, test on the other halfSplit your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. Votes?
  • 43. What data set should you generally test on?The data set you trained your classifier onA data set from a different tutorSplit your data set in half (by students), train on one half, test on the other halfSplit your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. What are the benefits and drawbacks of each?
  • 44. The dangerous one(though still sometimes OK)The data set you trained your classifier onIf you do this, there is serious danger of over-fitting
  • 45. The dangerous one(though still sometimes OK)You have ten thousand data points. You fit a parameter for each data point.“If data point 1, RIGHT. If data point 78, WRONG…”Your accuracy is 100%Your kappa is 1Your model will neither work on new data, nor will it tell you anything.
  • 46. The dangerous one(though still sometimes OK)The data set you trained your classifier on When might this one still be OK?
  • 47. K-fold cross validation (standard)Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?
  • 48. K-fold cross validation (standard)Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?Your detector will work with new data from the same students
  • 49. K-fold cross validation (student-level)Split your data set in half (by student), train on one half, test on the other halfWhat can you infer from this?
  • 50. K-fold cross validation (student-level)Split your data set in half (by student), train on one half, test on the other halfWhat can you infer from this?Your detector will work with data from new students from the same population (whatever it was)
  • 51. A data set from a different tutorThe most stringent testWhen your model succeeds at this test, you know you have a good/general modelWhen it fails, it’s sometimes hard to know why
  • 52. An interesting alternativeLeave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006)Train on data from 3 or more tutorsTest on data from a different tutor(Repeat for all possible combinations)Good for giving a picture of how well your model will perform in new lessons
  • 55. RegressionThere is something you want to predict (“the label”)The thing you want to predict is numericalNumber of hints student requestsHow long student takes to answerWhat will the student’s test score be
  • 56. RegressionAssociated with each label are a set of “features”, which maybe you can use to predict the labelSkill pknow time totalactionsnumhintsENTERINGGIVEN 0.704 9 1 0ENTERINGGIVEN 0.502 10 2 0 USEDIFFNUM 0.049 6 1 3 ENTERINGGIVEN 0.967 7 3 0 REMOVECOEFF 0.792 16 1 1 REMOVECOEFF 0.792 13 2 0 USEDIFFNUM 0.073 5 2 0 ….
  • 57. RegressionThe basic idea of regression is to determine which features, in which combination, can predict the label’s valueSkill pknow time totalactionsnumhintsENTERINGGIVEN 0.704 9 1 0ENTERINGGIVEN 0.502 10 2 0 USEDIFFNUM 0.049 6 1 3 ENTERINGGIVEN 0.967 7 3 0 REMOVECOEFF 0.792 16 1 1 REMOVECOEFF 0.792 13 2 0 USEDIFFNUM 0.073 5 2 0 ….
  • 58. Linear RegressionThe most classic form of regression is linear regressionAlternatives include Poisson regression, Neural Networks...
  • 59. Linear RegressionThe most classic form of regression is linear regressionNumhints = 0.12*Pknow + 0.932*Time – 0.11*TotalactionsSkill pknow time totalactionsnumhintsCOMPUTESLOPE 0.544 9 1 ?
  • 60. Linear RegressionLinear regression only fits linear functions (except when you apply transforms to the input variables, which RapidMiner can do for you…)
  • 61. Linear RegressionHowever…It is blazing fastIt is often more accurate than more complex models, particularly once you cross-validateData Mining’s “Dirty Little Secret”It is feasible to understand your model(with the caveat that the second feature in your model is in the context of the first feature, and so on)
  • 62. Example of CaveatLet’s study a classic example
  • 63. Example of CaveatLet’s study a classic exampleDrinking too much prune nog at a party, and having an emergency trip to the Little Researcher’s Room
  • 64. Data
  • 65. DataSome people are resistent to the deletrious effects of prunes and can safely enjoy high quantities of prune nog!
  • 66. Learned FunctionProbability of “emergency”= 0.25 * # Drinks of nog last 3 hours - 0.018 * (Drinks of nog last 3 hours)2But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less “emergencies”?
  • 67. Learned FunctionProbability of “emergency”= 0.25 * # Drinks of nog last 3 hours - 0.018 * (Drinks of nog last 3 hours)2But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less “emergencies”?No!
  • 68. Example of Caveat(Drinks of nog last 3 hours)2 is actually positively correlated with emergencies!r=0.59
  • 69. Example of CaveatThe relationship is only in the negative direction when (Drinks of nog last 3 hours) is already in the model…
  • 70. Example of CaveatSo be careful when interpreting linear regression models (or almost any other type of model)
  • 73. Why do Discovery with Models?Let’s say you have a model of some construct of interest or importanceKnowledgeMeta-CognitionMotivationAffectInquiry SkillCollaborative BehaviorEtc.
  • 74. Why do Discovery with Models?You can use that model toFind outliers of interest by finding out where the model makes extreme predictionsInspect the model to learn what factors are involved in predicting the constructFind out the construct’s relationship to other constructs of interest, by studying its correlations/associations/causal relationships with data/models on the other constructsStudy the construct across contexts or students, by applying the model within data from those contexts or studentsAnd more…
  • 75. Most frequentlyDone using prediction modelsThough other types of models (in particular knowledge engineering models) are amenable to this as well!
  • 77. BoostingLet’s say that you have 300 labeled actions randomly sampled from 600,000 overall actions Not a terribly unusual case, in these days of massive data sets, like those in the PSLC DataShopYou can train the model on the 300, cross-validate it, and then apply it to all 600,000And then analyze the model across all actionsMakes it possible to study larger-scale problems than a human could do without computer assistanceEspecially nice if you have some unlabeled data set with nice propertiesFor example, additional data such as questionnaire data(cf. Baker, Walonoski, Heffernan, Roll, Corbett, & Koedinger, 2008)
  • 78. However…To do this and trust the result,You should validate that the model can transfer across students, populations, and to the learning software you’re usingAs discussed earlier
  • 81. Skills from the Algebra TutorProbability of learning skill at each opportunityInitial probability of knowing skill
  • 82. Which skills could probably be removed from the tutor?
  • 83. Which skills could use better instruction?
  • 85. A lengthier example (if there’s time)Applying Baker et al’s (2008) gaming detector across contexts
  • 86. Research QuestionDo students game the system because of state or trait factors?If trait factors are the main explanation, differences between students will explain much of the variance in gamingIf state factors are the main explanation, differences between lessons could account for many (but not all) state factors, and explain much of the variance in gamingSo: is the student or the lesson a better predictor of gaming?
  • 87. Application of DetectorAfter validating its transferWe applied the gaming detector across 35 lessons, used by 240 students, from a single Cognitive TutorGiving us, for each student in each lesson, a gaming frequency
  • 88. ModelLinear Regression modelsGaming frequency = Lesson + a0Gaming frequency = Student + a0
  • 89. ModelCategorical variables transformed to a set of binariesi.e. Lesson = Scatterplot becomes3DGeometry = 0Percents = 0Probability = 0Scatterplot = 1Boxplot = 0Etc…
  • 91. r2The correlation, squaredThe proportion of variability in the data set that is accounted for by a statistical model
  • 92. r2The correlation, squaredThe proportion of variability in the data set that is accounted for by a statistical model
  • 93. r2However, a limitationThe more variables you have, the more variance you should be expected to predict, just by chance
  • 94. r2We should expect240 students To predict gaming better than35 lessonsJust by overfitting
  • 95. So what can we do?
  • 96. BiCBayesian Information Criterion(Raftery, 1995)Makes trade-off between goodness of fit and flexibility of fit (number of parameters)
  • 98. The Lesson Gaming frequency = Lesson + a035 parametersr2 = 0.55BiC’ = -2370Model is significantly better than chance would predict given model size & data set size
  • 99. The StudentGaming frequency = Student + a0240 parametersr2 = 0.16BiC’ = 1382Model is worse than chance would predict given model size & data set size!
  • 100. Standard deviation bars, not standard error bars
  • 106. Mega-SurveyI need a volunteer to bring these surveys to Jim Doyle after class*NOT THE REGISTRAR*
  • 107. Mega-Survey Additional Questions(See back)#1: In future years, should this class be given 1: In half a semester, as part of a unified semester class, along with Professor Skorinko’s Research Methods class3: Unsure/neutral5: As a full-semester class, with Professor Skorinko’s class as a prerequisite#2: Are there any topics you think should be dropped from this class? [write your answer in the space to the right]#3: Are there any topics you think should be added to this class? [write your answer in the space to the right]

Editor's Notes