This document describes a method for emotional analysis of videos using multi-modal representations and dense trajectories. The method learns mid-level audio, static visual, and motion representations which are fused to classify video segments into emotional categories. It is evaluated on music video clips from the DEAP dataset and is shown to outperform approaches using only low-level features or other published methods.