SlideShare a Scribd company logo
Simpler 	

Machine Learning 	

with SKLL
Dan Blanchard	

Educational Testing Service	

dblanchard@ets.org	



PyData NYC 2013
Simpler Machine Learning with SKLL
Simpler Machine Learning with SKLL
Simpler Machine Learning with SKLL
Survived

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old

Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
SKLL
SKLL

It's where the learning happens.
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
$ ./make_titanic_example_data.py
!
Creating titanic/train directory
Creating titanic/dev directory
Creating titanic/test directory
Loading train.csv............done
Loading test.csv........done
Learning to Predict Survival
2. Pick classifiers to try:	

1. Random forest	

2. Support Vector Machine (SVM)	

3. Naive Bayes
Learning to Predict Survival
3. Create configuration file for SKLL
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
directory with feature files
train_location = train
for training learner
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
directory with feature files
test_location = dev
for evaluating performance
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
# of siblings, spouses,
train_location = train children
parents,
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
departure port
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev & passenger class
fare
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
sex, & age
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
directory to store evaluation results
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
4. Run the configuration file with run_experiment
$ run_experiment evaluate.cfg
!
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
Loading dev/misc.csv.....done
Loading dev/socioeconomic.csv.....done
Loading dev/vitals.csv.....done
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
...
Learning to Predict Survival
5. Examine results
Experiment Name: Titanic_Evaluate
Training Set: train
Test Set: dev
Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv",
"vitals.csv"]
Learner: RandomForestClassifier
Task: evaluate
!
+-------+------+------+-----------+--------+-----------+
|
| 0.0 | 1.0 | Precision | Recall | F-measure |
+-------+------+------+-----------+--------+-----------+
| 0.000 | [97] |
18 |
0.874 | 0.843 |
0.858 |
+-------+------+------+-----------+--------+-----------+
| 1.000 |
14 | [50] |
0.735 | 0.781 |
0.758 |
+-------+------+------+-----------+--------+-----------+
(row = reference; column = predicted)
Accuracy = 0.8212290502793296
Aggregate Evaluation Results

Dev.
Accuracy

Learner

0.821

RandomForestClassifier

0.771

SVC

0.709

MultinomialNB
Tuning learner
• Can we do better than default hyperparameters?
Tuning learner
• Can we do better than default hyperparameters?
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Using All Available Data
Using All Available Data
• Use training and dev to generate predictions on test
Using All Available Data
• Use training and dev to generate predictions on test
[General]
experiment_name = Titanic_Predict
task = predict
!
[Input]
train_location = train+dev
test_location = test
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Test Set Performance

Untuned
Accuracy
(Train only)

Tuned
Accuracy
(Train only)

Untuned
Tuned
Accuracy
Accuracy
(Train + Dev) (Train + Dev)

0.732

0.746

0.746

0.756 RandomForestClassifier

0.608

0.617

0.612

0.641

SVC

0.627

0.623

0.622

0.622

MultinomialNB

Learner
Advanced SKLL Features
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
• Python API
Currently Supported Learners
Classifiers

Regressors

Linear Support Vector Machine

Elastic Net

Logistic Regression

Lasso

Multinomial Naive Bayes

Linear
Decision Tree

Gradient Boosting
Random Forest
Support Vector Machine
Coming Soon
Classifiers

Regressors
AdaBoost
K-Nearest Neighbors

Stochastic Gradient Descent
Acknowledgements
• Mike Heilman	

• Nitin Madnani	

• Aoife Cahill
References
• Dataset: kaggle.com/c/titanic-gettingStarted	

• SKLL GitHub: github.com/EducationalTestingService/skll	

• SKLL Docs: skll.readthedocs.org	

• Titanic configs and data splitting script in examples dir
on GitHub
@Dan_S_Blanchard	

!

dan-blanchard
Bonus Slides
Cross-validation
[General]
experiment_name = Titanic_CV
task = cross_validate
!
[Input]
train_location = train+dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv",
"vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Cross-validation Results
Avg. CV
Accuracy

Learner

0.815

RandomForestClassifier

0.717

SVC

0.681

MultinomialNB
SKLL API
SKLL API
from skll import Learner, load_examples
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
confusion matrix
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
precision, recall, f-score
# Load test examples and evaluate
for each class
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
tuned model
# Load test examples and evaluate parameters
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
objective function
test_examples = load_examples('test.tsv')
score on test set
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
per-fold
# evaluation results cross-validation with a radial SVM
Perform 10-fold
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
per-fold training
learner = Learner('SVC')
set obj. scores
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
import numpy as np
import os
from skll import write_feature_file
!
# Create some training examples
classes = []
ids = []
features = []
for i in range(num_train_examples):
y = "dog" if i % 2 == 0 else "cat"
ex_id = "{}{}".format(y, i)
x = {"f1": np.random.randint(1, 4),
"f2": np.random.randint(1, 4),
"f3": np.random.randint(1, 4)}
classes.append(y)
ids.append(ex_id)
features.append(x)
# Write them to a file
train_path = os.path.join(_my_dir, 'train',
'test_summary.jsonlines')
write_feature_file(train_path, ids, classes, features)

More Related Content

PPTX
Machine learning from disaster
PPTX
Logistic Regression in Sports Research
PPTX
Text categorization
PDF
You Don t Know JS ES6 Beyond Kyle Simpson
PDF
Privet Kotlin (Windy City DevFest)
PDF
Machine Learning at Geeky Base 2
PDF
Module 6: Ensemble Algorithms
KEY
Inverting the classroom, improving student learning
Machine learning from disaster
Logistic Regression in Sports Research
Text categorization
You Don t Know JS ES6 Beyond Kyle Simpson
Privet Kotlin (Windy City DevFest)
Machine Learning at Geeky Base 2
Module 6: Ensemble Algorithms
Inverting the classroom, improving student learning

Similar to Simpler Machine Learning with SKLL (20)

PDF
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
PDF
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
PPTX
02 - Prepcode
PPT
ppt
PDF
Causal inference-for-profit | Dan McKinley | DN18
PDF
DN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
PPTX
Introduction to Julia
PDF
It's Not Magic - Explaining classification algorithms
PDF
Final Project
ODP
Automated Testing in Django
PPTX
“Insulin” for Scala’s Syntactic Diabetes
PDF
Variational Autoencoder from scratch.pdf
PPT
Learning Java 1 – Introduction
PDF
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
PDF
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
PDF
Taking the boilerplate out of your tests with Sourcery
KEY
Django’s nasal passage
PPTX
JAVA LOOP.pptx
PPTX
2. overview of c#
PDF
An introduction to Google test framework
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
02 - Prepcode
ppt
Causal inference-for-profit | Dan McKinley | DN18
DN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
Introduction to Julia
It's Not Magic - Explaining classification algorithms
Final Project
Automated Testing in Django
“Insulin” for Scala’s Syntactic Diabetes
Variational Autoencoder from scratch.pdf
Learning Java 1 – Introduction
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Taking the boilerplate out of your tests with Sourcery
Django’s nasal passage
JAVA LOOP.pptx
2. overview of c#
An introduction to Google test framework
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25-Week II
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Ad

Simpler Machine Learning with SKLL

  • 1. Simpler Machine Learning with SKLL Dan Blanchard Educational Testing Service dblanchard@ets.org 
 PyData NYC 2013
  • 7. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old
  • 8. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 9. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 11. SKLL
  • 12. SKLL
  • 13. SKLL It's where the learning happens.
  • 14. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%)
  • 15. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py ! Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done
  • 16. Learning to Predict Survival 2. Pick classifiers to try: 1. Random forest 2. Support Vector Machine (SVM) 3. Naive Bayes
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL
  • 18. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 19. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] directory with feature files train_location = train for training learner test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train directory with feature files test_location = dev for evaluating performance featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] # of siblings, spouses, train_location = train children parents, test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 23. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train departure port test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 24. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev & passenger class fare featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 25. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev sex, & age featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 26. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 27. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output directory to store evaluation results models = output
  • 28. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 29. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 30. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg ! Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...
  • 31. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate ! +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296
  • 33. Tuning learner • Can we do better than default hyperparameters?
  • 34. Tuning learner • Can we do better than default hyperparameters? [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 38. Using All Available Data • Use training and dev to generate predictions on test
  • 39. Using All Available Data • Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict ! [Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 40. Test Set Performance Untuned Accuracy (Train only) Tuned Accuracy (Train only) Untuned Tuned Accuracy Accuracy (Train + Dev) (Train + Dev) 0.732 0.746 0.746 0.756 RandomForestClassifier 0.608 0.617 0.612 0.641 SVC 0.627 0.623 0.622 0.622 MultinomialNB Learner
  • 42. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data
  • 43. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors
  • 44. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters
  • 45. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments
  • 46. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file
  • 47. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data
  • 48. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling
  • 49. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling • Python API
  • 50. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear Decision Tree Gradient Boosting Random Forest Support Vector Machine
  • 52. Acknowledgements • Mike Heilman • Nitin Madnani • Aoife Cahill
  • 53. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @Dan_S_Blanchard ! dan-blanchard
  • 55. Cross-validation [General] experiment_name = Titanic_CV task = cross_validate ! [Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 58. SKLL API from skll import Learner, load_examples
  • 59. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam')
  • 60. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples)
  • 61. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 62. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate confusion matrix test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 63. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 64. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score # Load test examples and evaluate for each class test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 65. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 66. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate objective function test_examples = load_examples('test.tsv') score on test set (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 67. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 68. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 69. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 70. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold # evaluation results cross-validation with a radial SVM Perform 10-fold learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 71. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM per-fold training learner = Learner('SVC') set obj. scores (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 72. SKLL API import numpy as np import os from skll import write_feature_file ! # Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)