Week 2:

  1. Baltrusaitis et al., Multimodal Machine Learning: A Survey and Taxonomy. TPAMI 2018
  2. Bengio et al., Representation Learning: A Review and New Perspectives. TPAMI 2013

Week 3:

  1. Zeiler and Fergus, Visualizing and Understanding Convolutional Networks. ECCV 2014
  2. Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017
  3. Karpathy et al., Visualizing and Understanding Recurrent Networks. arXiv 2015
  4. Khandelwal et al., Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. ACL 2018

Week 4:

  1. Owens et al., Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. ECCV 2018
  2. Wang et al., Learning Deep Structure-Preserving Image-Text Embeddings. CVPR 2016
  3. Eisenschtat and Wolf, Linking Image and Text with 2-Way Nets. CVPR 2017
  4. Zhang et al., AE2-Nets: Autoencoder in Autoencoder Networks. CVPR 2019

Week 5:

  1. Anderson et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR 2018
  2. Wiegreffe and Pinter, Attention is not not Explanation. EMNLP 2019
  3. Le et al., Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems. ACL 2019
  4. Tan and Bansal, LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP 2019

Week 7:

  1. Mao et al., The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. ICLR 2019
  2. Kottur et al., Visual Coreference Resolution in Visual Dialog using Neural Module Networks. ECCV 2018
  3. Cuturi and Blondel, Soft-DTW: a Differentiable Loss Function for Time-Series. ICML 2017
  4. Zhu et al., Toward Multimodal Image-to-Image Translation. NeurIPS 2017

Week 8:

  1. Sigurdsson et al., Asynchronous Temporal Fields for Action Recognition. CVPR 2017
  2. Dai et al., Detecting Visual Relationships with Deep Relational Networks. CVPR 2017
  3. Wu and Goodman, Multimodal Generative Models for Scalable Weakly-Supervised Learning. NeurIPS 2018
  4. Zhu et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017

Week 9:

  1. Lee et al., Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks. ICRA 2019
  2. Luketina at al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019
  3. Das et al., Neural Modular Control for Embodied Question Answering. CoRL 2018
  4. Dai et al., Towards Diverse and Natural Image Descriptions via a Conditional GAN. ICCV 2019

Week 10:

  1. Pang and Wang, Guessing State Tracking for Visual Dialogue. ECCV 2020
  2. Hu et al., Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. CVPR 2020
  3. Hudson and Manning, Learning by Abstraction: The Neural State Machine. NeurIPS 2019
  4. Hill et al., Grounded Language Learning Fast and Slow. 2020

Week 12:

  1. Anderson et al., Sim-to-Real Transfer for Vision-and-Language Navigation. CoRL 2020
  2. Blukis et al., Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction. CoRL 2018
  3. Kojima et al., What is Learned in Visually Grounded Neural Syntax Acquisition. ACL 2020
  4. Zhu et al., The Return of Lexical Dependencies: Neural Lexicalized PCFGs. TACL 2020

Week 14:

  1. Alikhani et al., Clue: Cross-modal Coherence Modeling for Caption Generation. ACL 2020
  2. Agarwal et al., History for Visual Dialog: Do we really need it?. ACL 2020
  3. Barocas and Selbst, Big Data’s Disparate Impact. California Law Review 2016
  4. Hovy and Spruit, The Social Impact of Natural Language Processing. ACL 2016