11-877 AMML | Schedule

** Exact topics and schedule subject to change, based on student interests and course discussions. **

Date	Topics	Readings
1/21	Week 1: Course introduction [slides] [synopsis] Course syllabus and requirements Dimensions of multimodal heterogenity	Multimodal Machine Learning: A Survey and Taxonomy Representation Learning: A Review and New Perspectives
1/28	Week 2: Cross-modal interactions [synopsis] What are the different ways in which modalities can interact with each other in multimodal tasks? Can we formalize a taxonomy of such cross-modal interactions, which will enable us to compare and contrast them more precisely? What are the design decisions (aka inductive biases) that can be used when modeling these cross-modal interactions in machine learning models? What are the advantages and drawbacks of designing models to capture each type of cross-modal interaction? Consider not just prediction performance, but tradeoffs in time/space complexity, interpretability, etc. Given an arbitrary dataset and prediction task, how can we systematically decide what type of cross-modal interactions exist, and how can that inform our modeling decisions? Given trained multimodal models, how can we understand or visualize the nature of cross-modal interactions?	Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! What Does BERT with Vision Look At? Multiplicative Interactions and Where to Find Them Cooperative Learning for Multi-view Analysis Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
2/4	Week 3: Multimodal co-learning [synopsis] What are the types of cross-modal interactions involved to enable such co-learning scenarios where multimodal training ends up generalizing to unimodal testing? What are some design decisions (inductive bias) that could be made to promote transfer of information from one modality to another? How do we ensure that during co-learning, only useful information is transferred, and not some undesirable bias? This may become a bigger issue in low-resource settings. How can we know if co-learning has succeeded? Or failed? What approaches could we develop to visualize and probe the success of co-learning? How can we formally, empirically, or intuitively measure the additional information provided by auxiliary modality? How can we design controlled experiments to test these hypotheses? What are the advantages and drawbacks of information transfer during co-learning? Consider not just prediction performance, but also tradeoffs with complexity, interpretability, fairness, etc.	Multimodal Prototypical Networks for Few-shot Learning SMIL: Multimodal Learning with Severely Missing Modality Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision What Makes Multi-modal Learning Better than Single (Provably) Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities Zero-Shot Learning Through Cross-Modal Transfer 12-in-1: Multi-Task Vision and Language Representation Learning A Survey of Reinforcement Learning Informed by Natural Language
2/11	Week 4: Pretraining paradigm [synopsis] Is large-scale pretraining the way forward for building general AI models? What information potentially cannot be captured by pretraining? What are the risks of pretraining? What are the types of cross-modal interactions that are likely to be modeled by current pretrained models? What are the cross-modal interactions that will be harder to model with these large-scale pretraining methods? How can we best integrate multimodality into pretrained language models? What kind of additional data and modeling/optimization decisions do we need? What are the different design decisions when integrating multimodal information in pretraining models and objectives? What are the main advantages and drawbacks of these design choices? Consider not just prediction performance, but tradeoffs in time/space complexity, interpretability, and so on. How can we evaluate the type of multimodal information learned in pretrained models? One approach is to look at downstream tasks, but what are other ways to uncover the knowledge stored in pretrained models?	Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers Multimodal Few-Shot Learning with Frozen Language Models Unifying Vision-and-Language Tasks via Text Generation FLAVA: A Foundational Language And Vision Alignment Model Pretrained Transformers as Universal Computation Engines On the Opportunities and Risks of Foundation Models Does Vision-and-Language Pretraining Improve Lexical Grounding? Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models Integrating Multimodal Information in Large Pretrained Transformers Zero-Shot Text-to-Image Generation
2/18	Week 5: Multimodal reasoning [synopsis] What are the various reasoning processes required in multimodal tasks, where data comes from heterogeneous sources? What could be a taxonomy of the main processes involved in multimodal reasoning? Are there unique technical challenges that arise because reasoning is performed on multimodal data? What are these unique challenges? How can we start studying these challenges in future research? How should we model cross-modal interactions when performing reasoning over multimodal data? Grounding words with visual objects could be an example of a reasoning step required with multimodal data. Other reasoning involved in modeling the different types of cross-modal interactions (e.g., additive, multiplicative)? What are the main advantages of reasoning-based approaches, when compared to the large-scale pre-training methods discussed last week? What are the potential issues with reasoning? Can we perform reasoning on very large datasets? Can pre-training methods eventually learn reasoning processes similar to humans? Or will we still need human and domain knowledge to some extent? Can you imagine a way to uncover the reasoning capabilities of black-box model, such as a large-scale pre-trained model? How can one discover specifically the cross-modal reasoning processes in such a black-box model? To what extent do we need external knowledge when performing reasoning, specifically multimodal reasoning? What type of external knowledge is likely to be needed to succeed in multimodal reasoning?	CLEVRER: CoLlision Events for Video REpresentation and Reasoning Neuro-Symbolic Visual Reasoning: Disentangling “Visual” from “Reasoning” Learning to Compose and Reason with Language Tree Structures for Visual Grounding Heterogeneous Graph Learning for Visual Commonsense Reasoning Multimodal Logical Inference System for Visual-Textual Entailment A Closer Look at the Robustness of Vision-and-Language Pre-trained Models CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning VQA-LOL: Visual Question Answering under the Lens of Logic Deep Compositional Question Answering with Neural Module Networks Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries KAT: A Knowledge Augmented Transformer for Vision-and-Language
2/25	Week 6: Memory and long-term interactions [synopsis] What are the scenarios in which memory for long-term interactions is required in multimodal tasks, where data comes from heterogeneous sources? What could be a taxonomy of long-range cross-modal interactions that may need to be stored in memory? What are certain methods of parametrizing memory in unimodal models that may be applied for multimodal settings, and the various strengths/weaknesses of each approach? How should we model long-term cross-modal interactions? How can we design models (perhaps with memory mechanisms) to ensure that these long-term cross-modal interactions are captured? What are the main advantages of explicitly building memory-based modules into our architectures, as compared to the large-scale pre-training methods/Transformer models discussed in week 4? Do Transformer models already capture memory and long-term interactions implicitly? To what extent do we need external knowledge when performing reasoning, specifically multimodal reasoning? What type of external knowledge is likely to be needed to succeed in multimodal reasoning? A related topic is multimodal summarization: how to summarize the main events from a long multimodal sequence. How can we summarize long sequences while keeping cross-modal interactions? What is unique about multimodal summarization?	Long Range Arena: A Benchmark for Efficient Transformers Large Memory Layers with Product Keys Dynamic Memory Networks for Visual and Textual Question Answering Multimodal Memory Modelling for Video Captioning Episodic Memory in Lifelong Language Learning ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection Hybrid computing using a neural network with dynamic external memory History Aware Multimodal Transformer for Vision-and-Language Navigation Do Transformers Need Deep Long-Range Memory? Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Neural Turing Machines Meta-Learning with Memory-Augmented Neural Networks
3/4	Week 7: No classes – Spring break None!	None!
3/11	Week 8: No classes – Spring break None!	None!
3/18	Week 9: Brain and multimodal perception [synopsis] What are the main takeaways from neuroscience regarding unimodal and multimodal processing, integration, alignment, translation, and co-learning? How can these insights inform our design of multimodal models, following the topics we covered previously (cross-modal interactions, co-learning, pre-training, reasoning)? To what extent should we design AI models with the explicit goal to mirror human perception and reasoning, versus relying on large-scale pre-training methods and general neural network models? What different paradigms for multimodal perception and learning could be better aligned with how the brain processes multiple heterogeneous modalities? How does the human brain represent different modalities (visual, acoustic)? Are these different modalities represented in very heterogeneous ways? How is information linked between modalities? What are several challenges and opportunities in multimodal learning from high-resolution signals such as fMRI and MEG/EEG? What are some ways in which multimodal learning can help in the future analysis of data collected in neuroscience?	Multimodal Images in the Brain Multimodal Mental Imagery Crossmodal Processing in the Human Brain: Insights from Functional Neuroimaging Studies Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons A Theoretical Computer Science Perspective on Consciousness Inducing brain-relevant bias in natural language processing models The Brain-IHM Dataset: a New Resource for Studying the Brain Basis of Human-Human and Human-Machine Conversations Multi-Modal Perception Decoding Brain Representations by Multimodal Learning of Neural Activity and Visual Features Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition BRAINZOOM: High Resolution Reconstruction from Multi-modal Brain Signals
3/25	Week 10: Beyond language and vision [synopsis] What are the modalities beyond language and vision that are important for real-world applications? What unique structure do they contain, and what are the main challenges in performing multimodal learning with them? When reflecting on the heterogeneous aspect of multimodal learning, how are the other modalities different from language, speech, and vision? What dimensions of heterogeneity are important for these other modalities? What are the cross-modal interactions that you expect in these other modalities? Could you see ways to model cross-modal interactions with these other modalities and with language and vision? How do the core research problems of unimodal and multimodal processing, integration, alignment, translation, and co-learning generalize to modalities beyond language and vision? What core insights from these ‘common’ modalities have yet to be explored in understudied modalities? What is the best way to visualize these relatively understudied modalities? How can we best analyze and characterize the multimodal interactions present between these other modalities? How to learn models for many modalities (10+ modalities)? What are the chances to create multimodal learning algorithms that work for all modalities? What are the tradeoffs between modality-specific multimodal models and general-purpose multimodal models? If two modalities are very far from each other (strong heterogeneity and/or encoding very different information), how can we address the problem of multimodal learning?	A comprehensive survey on multimodal medical signals fusion for smart healthcare systems Multimodal Sensor Fusion with Differentiable Filters Integration of EEG/MEG with MRI and fMRI A Multi-Sensor Fusion System for Moving Object Detection and Tracking in Urban Driving Environments Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks Multi-sensor fusion in body sensor networks: State-of-the-art and research challenges Multi-Sensor Fusion: A Perspective MultiBench: Multiscale Benchmarks for Multimodal Representation Learning HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning Sensor Fusion for Mobile Robot Navigation Multi-source information fusion based on rough set theory: A review Combining EEG and fMRI: A Multimodal Tool for Epilepsy Research Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights
4/1	Week 11: Dataset and model biases [synopsis] What could be a taxonomy of biases in multimodal datasets and models? What are some risks related to biases (e.g., social biases) when creating new datasets? How are these risks potentially amplified or reduced when the dataset is multimodal, with heterogeneous modalities? Are there any biases that are specific to multimodal data? What are the imperfections that may arise during human annotations? How do these imperfections in data and labels affect multimodal learning of multimodal representations, cross-modal interactions, co-learning, and pre-training? Can biases also emerge not only from the multimodal training data, but also from the modeling design decisions themselves? What aspects of multimodal modeling are most prone to learning and possibly emphasizing biases? What are potential solutions for tackling these risks and biases in multimodal datasets and models? How can we properly identify, visualize and eventually reduce these biases in multimodal datasets and models? How can we better interpret multimodal datasets and models to check for potential biases? What specific dimensions should we strive to understand? What are the tradeoffs between large-scale, noisily-collected and annotated multimodal datasets versus small-scale, carefully-curated and annotated datasets? How do these affect multimodal modeling? How does it relate to the popular pre-training paradigm?	Shortcut learning in deep neural networks Measuring Social Biases in Grounded Vision and Language Embeddings Multimodal datasets: misogyny, pornography, and malignant stereotypes A Case Study of the Shortcut Effects in Visual Commonsense Reasoning Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers Revisiting Visual Question Answering Baselines Analyzing the Behavior of Visual Question Answering Models Adversarial Filters of Dataset Biases Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Annotation Artifacts in Natural Language Inference Data AI and the Everything in the Whole Wide World Benchmark Perceptual Score: What Data Modalities Does Your Model Perceive? Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers Challenges in real-life emotion annotation and machine learning based detection
4/8	Week 12: No classes – CMU Carnival None!	None!
4/15	Week 13: Explainability and interpretability [synopsis] What is a taxonomy of multimodal phenomena we should aim to interpret? In a perfect world, what multimodal information would you expect to be available when interpreting a multimodal model? What multimodal phenomena and characteristics would you want from this “perfect” interpretable model? What aspects of multimodal interpretability extend beyond the unimodal case? What are the dependencies between unimodal and multimodal interpretability? In other words, what needs to be solved on the unimodal side so that we are successful in multimodal interpretability? What approaches and techniques can you imagine being best suited for multimodal interpretation? How should we visualize the results of these multimodal interpretations? Black-box model interpretation vs interpretation by design (white-box)? How can we evaluate that a specific multimodal phenomena (e.g., bimodal interactions) was properly interpreted? How do we measure success in multimodal interpretability? Separate from model interpretation, there is also the topic of dataset interpretation: characterizing and interpreting the multimodal phenomena present in the data itself, independent of a specific model or prediction task. How can we perform multimodal data interpretation, and are there any differences with multimodal model interpretation? What is the best way to visualize relatively understudied modalities beyond language and vision? How can we best analyze and characterize the multimodal interactions present between these other modalities? What are the unique challenges to multimodal explainability, where not only the model is multimodal but also the explanation is potentially multimodal?	M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers Multimodal Explanations by Predicting Counterfactuality in Videos "Why Should I Trust You?": Explaining the Predictions of Any Classifier The Mythos of Model Interpretability Interpretable Machine Learning: Moving From Mythos to Diagnostics The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective Do explanations make VQA models more predictable to a human? Multimodal Neurons in Artificial Neural Networks DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis How does this interaction affect me? Interpretable attribution for feature interactions Leveraging Sparse Linear Layers for Debuggable Deep Networks Rethinking Explainability as a Dialogue: A Practitioner's Perspective ExplainaBoard: An Explainable Leaderboard for NLP
4/22	Week 14: Multimodal generation and ethical concerns [synopsis] What are some challenges in multimodal generation beyond generating each modality individually? How can we synchronize generation across multiple modalities? What degree of multimodal modeling is required for these cross-modal generation to be possible? For example, how much do models need to learn regarding cross-modal interactions, alignment, reasoning, etc? What are the qualities we should consider when evaluating outputs from multimodal generation? What do you think is the best practice to evaluate these qualities? Can we efficiently evaluate these qualities, at scale? What are the opportunities and challenges of automatic and human evaluation? How can we combine the best of both worlds? What are the real-world ethical issues regarding generation? How are these risks potentially amplified or reduced when the dataset is multimodal, with heterogeneous modalities? Are there any ethical issues that are specific to multimodal generation? How can we build a taxonomy of the main ethical concerns related to multimodal generation? How can we update our best practices to help address these ethical concerns? Who is better placed to start this dialogue? How can we make significant changes in this direction of reducing ethical issues?	VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning DALL·E: Creating Images from Text and DALL·E 2 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? The Social Impact of Deepfakes What a machine learning tool that turns Obama white can (and can’t) tell us about AI bias What comprises a good talking-head video generation?: A Survey and Benchmark Defending Against Neural Fake News Lessons from the PULSE Model and Discussion Text-to-Image Generation Grounded by Fine-Grained User Attention Training for Diversity in Image Paragraph Captioning Multimodal Abstractive Summarization for How2 Videos Extracting Training Data from Large Language Models What Makes Tom Hanks Look Like Tom Hanks Video Generation From Text
4/29	Week 15: Generalization, low-resource, and robustness [synopsis] One general claim is that pre-trained models can help with low-resource settings (e.g., few-shot fine-tuning). What are the multimodal problems where the paradigm of pre-training and fine-tuning may not generalize? What are the technical challenges? What are new research paradigms that should be explored to address the challenges of multimodal low-resource problems? Can you propose a taxonomy of the challenges that should be addressed to make progress in this direction, for low-resource modalities? How can we develop new models that generalize across many modalities (beyond 2 or 3)? What are the tradeoffs between modality-specific multimodal models and general-purpose multimodal models? What are the commonalities and underlying principles shared across diverse modalities and tasks that can enable good generalization? In other words, what are the pre-requirement for generalization to succeed? What are the limits of generalization? In other words, in which cases is generalization across modalities and tasks not possible due to possibly to data heterogeneity or some other reasons? What are these scenarios where generalization may not be possible? How can we perform generalization in the absence of explicit alignment (e.g., paired data) between modalities? How can we tackle the challenges of learning cross-modal interactions, alignment, reasoning, etc? One other aspect of generalization is with real-world settings where noise is present and modalities may be even missing. How can we robustly handle these noisy situations? How can multimodal help? Can multimodal also make these noisy situations harder?	Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Robust Contrastive Learning against Noisy Views Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions Unsupervised Image Captioning LiT: Zero-Shot Transfer with Locked-image text Tuning Missing Modalities Imputation via Cascaded Residual Autoencoder Multimodal Learning with Incomplete Modalities by Knowledge Distillation Unsupervised Multimodal Representation Learning across Medical Images and Reports Multimodal Prototypical Networks for Few-shot Learning MultiBench: Multiscale Benchmarks for Multimodal Representation Learning