KOSMOS-1 is a multimodal large language model that can perceive and process language as well as visual inputs like images. It was trained on large web-scale datasets containing text, images, and image-caption pairs to align its vision capabilities with its natural language understanding. Experimental results showed that KOSMOS-1 can perform well on tasks involving language, vision, and their combination, including image captioning, visual question answering, and describing images based on text instructions, all without any fine-tuning. The ability to perceive and understand different modalities allows language models to acquire knowledge in new ways and expands their application to areas like robotics and document intelligence.