Aspiring Minds | Automata

Aspiring Minds
www.aspiringminds.com
Grading Programs using
Machine Learning
Varun Aggarwal
Presented at KDD, 2014

Programming Assessments: Existing solutions
• Manual evaluation: Can’t scale; not standardized
• Test-case based evaluation:
• High false-positives – hard code, inadvertent errors
• High false-negatives – correct code but not efficient
• Similarity metric between control flow graphs, syntax trees:
• Need to handle multiple correct implementations – theoretically doesn’t fit in
• No mapping of metric to an objective feedback

Automatic grading of programs– Why?
• Widely performed - will help professors and TAs save a lot of time.
• Companies can recruit efficiently
• MOOCs - need automated open response assessments to really make it effective. True scaling of such system
currently not achieved.

A model to predict the logical correctness of a
program, given the control and data
dependencies it possesses.
Our Approach
Automata – Automatic program evaluation engine
Machine Learning based scoring
engine
Evaluation of programming best
practices
Asymptotic complexity evaluation
Lint-styled rule-based system to detect
programs not following programming best
practices.
Measures the run-time of the code for
various input sizes
and empirically derives the complexity.

Why programming modules give a better test-
shortlist rate ?
• Programming has more predictive power in identifying good performers than Logical ability.
• Due to lower predictive power of Logical, a higher cut-off has to be applied to it as compared to
Programming to get the same organizational efficiency.
• Higher the Programming capability of the person, requirement on Logical score is lesser.
• Given the person is lower than a given score on Programming, even having a higher logical ability does
not help.

Evaluation Rubric
ML based scoring
Understanding the human process
Problem and Language independent
Features
Machine learning model
Ungraded programs
Graded programs
Predicted grades
1 2
3
4
5
6 7

Evaluation Rubric
Our Approach
Features
Ungraded programs
Graded programs
Predicted grades
1

Evaluation Rubric
Our Approach
Features
Ungraded programs
Graded programs
Predicted grades
2

Evaluation Rubric
Our Approach
Features
Ungraded programs
Graded programs
Predicted grades
3

void print_1(int N){
for(i =1 ; i<=N; i++){
print newline;
count = i;
for(j=0; j<i; j++)
print count;
count++;
}
}
1
2 3
3 4 5
4 5 6 7
OBJECTIVE
To print the pattern of integers
An implementation
1. Are there loops? Are there
print statements?
3. Is the conditional in the inner loop dependent on
- a variable modified in the outer loop?
- a variable used in the conditional of the outer
loop?
What does a grader look for?
2. Is there a nested-loop structure?

Grammar for expressing features
• Simple features
• Keywords and Tokens (Counts):
• Tokens like for, if, return, break; function calls like printf, strrev, strcat; declarations like int, char
• Operators like various arithmetic, logical, relational operators used
• Character constants like ‘0’, ‘ ’, ‘65’, ‘96’
• Capturing logical constructs (Interactions)
• Control flow structure
• Data-dependencies
• Data-dependencies in context of control-flow

CONTROL FEATURES – COUNTS
Counts of control-related keywords/tokens
Ex. count(for) = 2
count(for-in-for) = 1
count(while) = 0
Control-context of these keywords
- The Print command as loop(loop(print)))
for(i =1 ; i<=N; i++){
print newline;
count = i;
CONTROL FLOW GRAPH
i = 1
i <= N
i++
j < i
count = i
j = 0
print(count)
count++
j++
END
Loop 1
Loop 2
Parent scope
for(j=0; j<i; j++)
print count; count++;
void print(int N){
}
}
TARGET PROGRAM

DATA OPERATION FEATURES IN CONTROL-CONTEXT
Counts of data-related tokens in context of the control structure
Ex. count(block1 :loop(loop(++))) = 2
count(block1 :loop(loop_cond(<))) = 1
Capture control-context of data-dependencies in groups of expressions
 i++ j < i : var (i) related to var (j) : appearing in a loop(loop_cond)
previously incremented : appearing in a loop
The relation and the increment happen in the same block
Loop 1
Loop 2
Loop 1
Loop 1
Loop 2
Loop 1
Loop 2
Loop 1
i = 1
i <= N
print(count)
count = ii++
j < i
count = 0
count++
j= 0
j++
Parent scope Parent scope
Loop 1
Loop 1
Loop 2
CONTROL FLOW INFORMATION
ANNOTATED IN A-D-D GRAPH
Loop 1

• Deployed Automata in a major product-based company’s recruitment
• Analyzed the performance improvement in using Automata over test-case pass based selection criterion
• 22.6% candidates who were not being shortlisted through test-case pass were now shortlisted using
Automata.
Case study

Experimental Results
Sort Problem

Doing it the one-class way!
PROBLEM All features Basic features
Mean Min25 Mean Min25
1 0.57 0.61 0.52 0.56
2 0.80 0.83 0.72 0.75
3 0.75 0.81 0.59 0.73
4 0.81 0.81 0.75 0.75
5 0.68 0.69 0.55 0.61
Betters test-case in all, but one

How good is the final ML-based score?
Validation Correlation >= 0.79
Matches Inter-rater Correlation between two human raters
PROBLEM # of features Cross-val correl Train correl Validation correl Test Case Score
1 80 0.61 0.85 0.79 0.54
2 68 0.77 0.93 0.91 0.80
3 193 0.91 0.98 0.90 0.64
4 66 0.90 0.94 0.90 0.80
5 87 0.81 0.92 0.84 0.84

Can we get insight?
• The most contributing feature for Find Digit problem -
int findDigit(int N, int digit){
…
…
LOOP (N != <constant value>){
…
N = N / <constant value>
…
}
}
Features for FindDigit problem
analyzed. Given a multi-digit
number and a digit, one has to
ﬁnd the number of times the
digit appears in the number

Yes, we can!
• The most contributing feature for Find Digit problem -
…
LOOP (N != <constant value>){
…
N = N / <constant value>
…
}
…
}
...
while(N != 0){
d = N%10;
if(d == digit)
...
N = N / 10;
}
}

Evaluation Rubric
Score Interpretation
5 Completely correct and efficient
An efficient implementation of the problem using right control structures and data-
dependencies.
4 Correct with some silly errors
Correct control structures and closely matching data-dependencies. Some silly
mistakes fail the code to pass test-cases.
3 Inconsistent logical structures
Right control structures start exist with few correct data dependencies
2 Emerging basic structures
Appropriate keywords and tokens present, showing some understanding of the
problem
1 Gibberish code
Seemingly unrelated to problem at hand.

Automata – Sample report
Candidate’s source code
Feedback on
programming
best practices
Asymptotic
complexity of the
candidate’s
solution
Test case pass/fail
information
Problem summary

Do our fancy features help?
Control and Data dependency features add around 0.15 correlation points above token information.
PROBLEM Type of feature # of features Cross-val correl Train correl Validation correl
1
All, w/o test case 35 0.57 0.72 0.56
Basic 60 0.62 0.87 0.41
2
All, w/o test case 80 0.81 0.99 0.80
Basic 26 0.59 0.72 0.67
3
All, w/o test case 190 0.87 0.97 0.90
Basic 26 0.74 0.89 0.74
4
All, w/o test case 134 0.85 0.91 0.82
Basic 35 0.83 0.88 0.69
5
All, w/o test case 166 0.66 0.81 0.64
Basic 40 0.61 0.78 0.61

Conclusion
• We propose the first machine learning based approach to automatically grade programs
• An innovative feature grammar is proposed which matches human intuition of grading programs.
• Models built for sample problems show promising results.
• We propose and demonstrate machine learning techniques to lower the need of human-graded data to build
models.

Aspiring Minds | Automata

More Related Content

What's hot (11)

Viewers also liked (17)

Similar to Aspiring Minds | Automata (20)

Recently uploaded (20)

Aspiring Minds | Automata