The document describes a proposed method for facial expression recognition in videos using 3D convolutional neural networks and long short-term memory. Specifically, it proposes a 3D Inception-ResNet architecture to extract both spatial and temporal features from video sequences. It also incorporates facial landmarks to emphasize important facial components. The landmarks are used to generate filters during training. Finally, an LSTM unit is used to further extract temporal information from the enhanced feature maps output by the 3D Inception-ResNet layers. The proposed method is evaluated on several facial expression databases and is shown to outperform state-of-the-art methods.