Machine Learning with Spark and Cassandra - Model Selection Tests

Machine Learning with Spark
and Cassandra - Model
Selection Tests

Series
Machine Learning with Spark and
Cassandra
● Environment Setup
● Data Pre-processing
● Testing
● Cross-Validation
● Model Selection Tests
● Deployment

What are model
selection tests?

Overview
● For comparing the relative performance of two different machine learning algorithms
○ Only gives information within a specific domain, based on the data used for tests
● Similar to statistical significance tests used in scientific research
○ Checking whether performance differences are due to model skill or random chance
○ Null hypothesis is that any observed difference is due to random chance
● Requires a specific shared measure of model skill
○ Cannot compare classification vs regression models
○ Cannot compare one models accuracy to another models f1-score
● Different tests make different statistical assumptions

Wilcoxon signed-rank test
● A version of the student’s t test, useful with a small number of samples
● Use k-fold cross validation to generate k scores for each model
● Feed those two sets of k accuracies into the wilcoxon signiﬁcance test
○ Not really writable as a formula
○ Involves calculating absolute differences between samples in a set and rank them based on the value of the
difference. Then you return their signs and sum the ranks.
○ The result is a p value. Like in scientiﬁc studies if p < 0.05 then we reject the null hypothesis.
■ P < 0.05 predicts a 5% chance that the results are this way due to statistical chance and 95% chance
that differences are due to actual existing differences
● Models must be trained and tested using exactly the same cross-validation folds

McNemar’s test
● Checks how well the predictions two models make, match
● Build a contingency table
○ Similar to a confusion matrix, but rather than class predictions its categories are based on whether each
model successfully predicted the actual value
○ Matrix values calculate x^2 which is then used to calculate p-values
● Works best if b,c have a large number of values
○ Variations exist for situations with low amounts of b,c

5x2CV paired t-test
● Another paired t-test variation, like the signed rank test
● Take a random 50% split of the data, train each model with this split for DiffA results and then ﬂip
them for DiffB results
○ Repeat ﬁve times and calculate the mean variance of the differences
○ Calculate the t statistic, then use t to calculate p-value

5x2CV combined F test
● A variation of the 5x2CV paired t test
● Rather than having two performance results for model a and model b, the performance metric is
combined and then we estimate mean and variance
● Then calculate f-statistic and use the to calculate p values

Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Machine Learning with Spark and Cassandra - Model Selection Tests

More Related Content

Similar to Machine Learning with Spark and Cassandra - Model Selection Tests (20)

More from Anant Corporation (20)

Recently uploaded (20)

Machine Learning with Spark and Cassandra - Model Selection Tests