This document discusses methods for 3D audio reconstruction and speaker recognition using supervised learning on voice and visual cues from a single video stream. It proposes detecting faces, classifying them using models trained on calibration data, and tracking face positions over time. Speech recognition is also performed to label speech frames with the corresponding speaker. Faces and speech likelihoods are combined for speaker recognition. Reconstructed 3D audio is created by convolving the audio with head-related transfer functions based on the speaker's detected position over time. The approach makes assumptions like having 1-2 speakers who are in the training database and no sudden movements. Accuracy of 95-100% is achieved on sample face classification and tracking tests.