Simpler Machine Learning with SKLL

Simpler

Machine Learning

with SKLL
Dan Blanchard

Educational Testing Service

dblanchard@ets.org

 
PyData NYC 2013

Survived
ﬁrst class,

female,

1 sibling,

35 years old

Perished

Survived
ﬁrst class,

female,

1 sibling,

35 years old

Perished
third class,

female,

2 siblings,

18 years old

Survived
ﬁrst class,

female,

1 sibling,

35 years old

Perished
third class,

female,

2 siblings,

18 years old

second class,
male,

0 siblings,

50 years old

Survived
ﬁrst class,

female,

1 sibling,

35 years old

Perished
third class,

female,

2 siblings,

18 years old

second class,
male,

0 siblings,

50 years old

Can we predict survival from data?

SKLL

It's where the learning happens.

Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)

1. Split up given training set: train (80%) and dev (20%)
$ ./make_titanic_example_data.py
!
Creating titanic/train directory
Creating titanic/dev directory
Creating titanic/test directory
Loading train.csv............done
Loading test.csv........done

2. Pick classiﬁers to try:

1. Random forest

2. Support Vector Machine (SVM)

3. Naive Bayes

3. Create conﬁguration ﬁle for SKLL

[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
directory with feature ﬁles
for training learner
test_location = dev
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
directory with feature ﬁles
test_location = dev
for evaluating performance
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
# of siblings, spouses,
train_location = train children
parents,
test_location = dev
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
departure port
test_location = dev
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
test_location = dev & passenger class
fare
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
test_location = dev
sex, & age
!
[Output]
results = output
models = output

[General]
task = evaluate
!
[Input]
test_location = dev
!
[Output]
results = output
directory to store evaluation results
models = output

[General]
task = evaluate
!
[Input]
test_location = dev
!
[Output]
results = output
models = output
directory to store trained models

4. Run the conﬁguration ﬁle with run_experiment
$ run_experiment evaluate.cfg
!
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
Loading dev/misc.csv.....done
Loading dev/socioeconomic.csv.....done
Loading dev/vitals.csv.....done
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
...

5. Examine results
Experiment Name: Titanic_Evaluate
Training Set: train
Test Set: dev
Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv",
"vitals.csv"]
Learner: RandomForestClassifier
Task: evaluate
!
+-------+------+------+-----------+--------+-----------+
|
| 0.0 | 1.0 | Precision | Recall | F-measure |
+-------+------+------+-----------+--------+-----------+
| 0.000 | [97] |
18 |
0.874 | 0.843 |
0.858 |
+-------+------+------+-----------+--------+-----------+
| 1.000 |
14 | [50] |
0.735 | 0.781 |
0.758 |
+-------+------+------+-----------+--------+-----------+
(row = reference; column = predicted)
Accuracy = 0.8212290502793296

Aggregate Evaluation Results

Dev.
Accuracy

Learner

0.821

RandomForestClassiﬁer

0.771

SVC

0.709

MultinomialNB

Tuning learner
• Can we do better than default hyperparameters?

Tuning learner
• Can we do better than default hyperparameters?
[General]
task = evaluate
!
[Input]
test_location = dev
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output

Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849


0.771

0.737

SVC

0.709

0.709

MultinomialNB

Using All Available Data
• Use training and dev to generate predictions on test

Using All Available Data
• Use training and dev to generate predictions on test
[General]
experiment_name = Titanic_Predict
task = predict
!
[Input]
train_location = train+dev
test_location = test
!
[Tuning]
grid_search = true
!
[Output]
results = output

Test Set Performance

Untuned
Accuracy
(Train only)

Tuned
Accuracy
(Train only)

Untuned
Tuned
Accuracy
Accuracy
(Train + Dev) (Train + Dev)

0.732

0.746

0.746

0.756 RandomForestClassiﬁer

0.608

0.617

0.612

0.641

SVC

0.627

0.623

0.622

0.622

MultinomialNB

Learner

Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data

and .tsv data
• Parameter grids for all supported classiﬁers/regressors

and .tsv data
• Parallelize experiments on DRMAA clusters

and .tsv data
• Ablation experiments

and .tsv data
• Collapse/rename classes from conﬁg ﬁle

and .tsv data
• Rescale predictions to be closer to observed data

and .tsv data
• Feature scaling

and .tsv data
• Feature scaling
• Python API

Currently Supported Learners
Classiﬁers

Regressors

Linear Support Vector Machine

Elastic Net

Logistic Regression

Lasso

Multinomial Naive Bayes

Linear
Decision Tree

Gradient Boosting
Random Forest
Support Vector Machine

Coming Soon
Classiﬁers

Regressors
AdaBoost
K-Nearest Neighbors

Stochastic Gradient Descent

Acknowledgements
• Mike Heilman

• Nitin Madnani

• Aoife Cahill

References
• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic conﬁgs and data splitting script in examples dir
on GitHub
@Dan_S_Blanchard

!

dan-blanchard

Cross-validation
[General]
experiment_name = Titanic_CV
task = cross_validate
!
[Input]
train_location = train+dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv",
"vitals.csv"]]
!
[Tuning]
grid_search = true
!
[Output]
results = output

Cross-validation Results
Avg. CV
Accuracy

Learner

0.815


0.717

SVC

0.681

MultinomialNB

SKLL API
from skll import Learner, load_examples

SKLL API
# Load training examples
train_examples = load_examples('myexamples.megam')

SKLL API
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)

SKLL API
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)

SKLL API
confusion matrix

SKLL API
precision, recall, f-score
for each class

SKLL API
tuned model
# Load test examples and evaluate parameters

SKLL API
objective function
score on test set

SKLL API
# Generate predictions from trained model
predictions = learner.predict(test_examples)

SKLL API
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)

SKLL API
per-fold
# evaluation results cross-validation with a radial SVM
Perform 10-fold
(fold_result_list,

SKLL API
# Perform 10-fold cross-validation with a radial SVM
per-fold training
set obj. scores
(fold_result_list,

SKLL API
import numpy as np
import os
from skll import write_feature_file
!
# Create some training examples
classes = []
ids = []
features = []
for i in range(num_train_examples):
y = "dog" if i % 2 == 0 else "cat"
ex_id = "{}{}".format(y, i)
x = {"f1": np.random.randint(1, 4),
"f2": np.random.randint(1, 4),
"f3": np.random.randint(1, 4)}
classes.append(y)
ids.append(ex_id)
features.append(x)
# Write them to a file
train_path = os.path.join(_my_dir, 'train',
'test_summary.jsonlines')
write_feature_file(train_path, ids, classes, features)

Simpler Machine Learning with SKLL

More Related Content

Similar to Simpler Machine Learning with SKLL (20)

Recently uploaded (20)

Simpler Machine Learning with SKLL