Position: Understanding and grounding of multimodal image-language models
The connection of image and language to multimodal image-language modeling brings many challenges to AI and previous work shows that the current models have misunderstood or unexplored issues, like relying on language priors. These problems suggest that it is essential to understand how models work and how the knowledge is encoded in them. In this dissertation thesis, we are focusing on grounding language in vision and understanding multimodal image-language models. This thesis is supervised by Jana Košecká, professor of computer science at George Mason University, USA.