The document discusses weakly supervised learning from video and images using convolutional neural networks. It describes using scripts as weak supervision for learning actions from movies without explicit labeling. Methods are presented for jointly learning actors and actions from scripts, and for action learning with ordering constraints. The use of CNNs for object and action recognition in images is also summarized, including work on training CNNs using only image-level labels without bounding boxes.