PHOSA is a model that takes a single 2D image as input and outputs a 3D human-object interaction scene without any 3D supervision. It uses the COCO dataset and pre-processes 3D meshes for objects. PHOSA consists of submodels like Detectron for segmentation, Frankmocap for pose estimation, and PoseOptimizer to find optimal object poses. It optimizes losses like occlusion-aware silhouette, human-object interaction, ordinal depth, and collision losses. The model was deployed using Flask on AWS to demonstrate 3D scene generation from a single image.