Readings
Week 2
- Paper A: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 3
- Paper B: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 4
- Paper C: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 5
- Paper D: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 6
- Paper E: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 7
- Paper F: Foundations & Recent Trends in Multimodal Machine Learning Definitions, Challenges, & Open Questions - Section 1, Section 2, Section 8
Week 3
- Paper A: Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
- Paper B: Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
- Paper C: Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions
- Paper D: What Makes Training Multi-modal Classification Networks Hard?
Week 5
- Paper A: Improving Multimodal fusion via Mutual Dependency Maximisation
- Paper B: Deep Multimodal Clustering for Unsupervised Audiovisual Learning
- Paper C: VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
- Paper D: Are Multimodal Transformers Robust to Missing Modality?
Week 7
- Paper A: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
- Paper B: Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- Paper C: Grounding Language Models to Images for Multimodal Inputs and Outputs
- Paper D: Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
- Paper 1 (optional): NExT-GPT: Any-to-Any Multimodal LLM
- Paper 2 (optional): Grounding ‘Grounding’ in NLP
- Paper 3 (optional): Multimodal Chain-of-Thought Reasoning in Language Models
- Paper 4 (optional): RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- Paper 5 (optional): Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control
Week 11
- Paper A: VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
- Paper B: MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA
- Paper C: Simple and Controllable Music Generation
- Paper D: Identifying Implicit Social Biases in Vision-Language Models [Note: Major part starts after the references, page 9]
- Paper 1 (optional): Text-to-image Diffusion Models in Generative AI: A Survey
- Paper 2 (optional): RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
- Paper 3 (optional): MusicLM: Generating Music From Text
- Paper 4 (optional): MultiModal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision Language Models
- Paper 5 (optional): Making the Most of Biased Samples via Contrastive Learning
Week 12
- Paper A: Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks
- Paper B: Provable Dynamic Fusion for Low-Quality Multimodal Data
- Paper C: What Makes Multi-modal Learning Better than Single (Provably)
- Paper D: Factorized Contrastive Learning: Going Beyond Multi-view Redundancy
- Paper 1 (optional): Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
- Paper 2 (optional): Ten myths of multimodal interaction