The paper discusses a multidimensional approach for human action detection in video content, integrating audio and visual cues to enhance recognition accuracy. Utilizing Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM), it presents a framework that processes audiovisual data to improve understanding and retrieval of video segments. Experimental results indicate that the combined audio-visual system significantly outperforms single modality methods in detecting action scenes.