This document provides an overview of data transformations needed to prepare data for machine learning applications. It discusses how data needs to be structured as instances with features for different machine learning tasks like classification, regression, clustering etc. It also covers common obstacles in data like missing values, incorrect formats and discusses techniques to address them like data cleaning, feature engineering and feature selection. Specific techniques discussed include joining multiple normalized datasets, handling missing values, aggregating features through counts and dealing with data from different sources and formats.
Related topics: