This document discusses using machine learning to improve the availability of IT services. It notes that scaling resources and ensuring high availability for infrastructure as a service (IaaS) is complex and expensive. The document proposes using machine learning to predict failures and proactively prevent downtime, rather than just reacting after crashes occur. It explains how machine learning can determine failure probabilities by learning from all available data and adaptive correlations rather than standard statistics. The document promotes automating infrastructure, distributing rather than centralizing components, and using self-optimizing systems to achieve zero-downtime.
Related topics: