A knowledge sharing seminar on multimodal models in computer vision and NLP

Jana Košecká, an Associate Professor of Computer Science at George Mason University, Virginia, gave a lecture on From Pixels to Words: Frontiers of Multi-Modal Vision-Language Learning.

Lecture abstract

In recent years, foundational models have reshaped the field of computer vision, achieving remarkable advances across tasks such as classification, detection, segmentation, and image generation. This talk surveys the current state of foundational vision models, highlighting advances in self-supervised learning, masked image modelling, contrastive learning, and vision transformers. These models are pre-trained on massive, diverse datasets and then adapted to a wide range of downstream applications. Beyond vision alone, we are witnessing a rapid convergence toward multi-modal vision-language models that align visual and textual information in shared embedding spaces. Models such as Segment Anything, CLIP, BLIP, Flamingo, and LLava bridge the gap between visual understanding and natural language, enabling powerful new capabilities like open-vocabulary recognition, image captioning, visual question answering, and interactive grounding. This talk will survey the state of the art in foundational vision models, explore the underlying principles that enable their generalisation, and highlight how vision-language integration is transforming both research and real-world applications. I will also discuss emerging challenges and future directions at the intersection of perception, language, and reasoning.

Photos from the lecture