Multimodal machine learning is a vibrant multi-disciplinary research field that addresses some of the original goals of AI via designing computer agents that are able to demonstrate intelligent capabilities such as understanding, reasoning and planning through integrating and modeling multiple communicative modalities, including linguistic, acoustic, and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning, visual question answering, and language-guided reinforcement learning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.

This tutorial builds upon the annual course on multimodal machine learning taught at Carnegie Mellon University and is a completely revised version of the previous tutorials on multimodal learning at CVPR, ACL, and ICMI conferences. The present tutorial is based on a revamped taxonomy of the core technical challenges present in multimodal machine learning, centered around these six core challenges: representation, alignment, reasoning, induction, generation and quantification. Recent technical achievements will be presented through the lens of this revamped taxonomy of multimodal core challenges, allowing researchers to understand similarities and differences between approaches and new models. The tutorial is also designed to give a perspective on future research directions in multimodal machine learning.