This document summarizes work using Bayesian optimization to compress BERT models for question answering while balancing model size and performance. It describes distilling BERT into smaller student models using SQuAD 2.0 data. SigOpt was used to tune model architectures and training to find models that exceeded the baseline performance while reducing size by over 20%. The best models found had 4-6 layers and maintained over 67% accuracy on SQuAD.