Readings

Week 2:

Baltrusaitis et al., Multimodal Machine Learning: A Survey and Taxonomy. TPAMI 2018
Bengio et al., Representation Learning: A Review and New Perspectives. TPAMI 2013

Week 3:

Zeiler and Fergus, Visualizing and Understanding Convolutional Networks. ECCV 2014
Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017
Karpathy et al., Visualizing and Understanding Recurrent Networks. arXiv 2015
Khandelwal et al., Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. ACL 2018

Week 4:

Owens et al., Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. ECCV 2018
Wang et al., Learning Deep Structure-Preserving Image-Text Embeddings. CVPR 2016
Eisenschtat and Wolf, Linking Image and Text with 2-Way Nets. CVPR 2017
Zhang et al., AE2-Nets: Autoencoder in Autoencoder Networks. CVPR 2019

Week 5:

Anderson et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR 2018
Wiegreffe and Pinter, Attention is not not Explanation. EMNLP 2019
Le et al., Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems. ACL 2019
Tan and Bansal, LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP 2019

Week 7:

Mao et al., The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. ICLR 2019
Kottur et al., Visual Coreference Resolution in Visual Dialog using Neural Module Networks. ECCV 2018
Cuturi and Blondel, Soft-DTW: a Differentiable Loss Function for Time-Series. ICML 2017
Zhu et al., Toward Multimodal Image-to-Image Translation. NeurIPS 2017

Week 8:

Sigurdsson et al., Asynchronous Temporal Fields for Action Recognition. CVPR 2017
Dai et al., Detecting Visual Relationships with Deep Relational Networks. CVPR 2017
Wu and Goodman, Multimodal Generative Models for Scalable Weakly-Supervised Learning. NeurIPS 2018
Zhu et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017

Week 9:

Lee et al., Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks. ICRA 2019
Luketina at al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019
Das et al., Neural Modular Control for Embodied Question Answering. CoRL 2018
Dai et al., Towards Diverse and Natural Image Descriptions via a Conditional GAN. ICCV 2019

Week 10:

Pang and Wang, Guessing State Tracking for Visual Dialogue. ECCV 2020
Hu et al., Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. CVPR 2020
Hudson and Manning, Learning by Abstraction: The Neural State Machine. NeurIPS 2019
Hill et al., Grounded Language Learning Fast and Slow. 2020

Week 12:

Anderson et al., Sim-to-Real Transfer for Vision-and-Language Navigation. CoRL 2020
Blukis et al., Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction. CoRL 2018
Kojima et al., What is Learned in Visually Grounded Neural Syntax Acquisition. ACL 2020
Zhu et al., The Return of Lexical Dependencies: Neural Lexicalized PCFGs. TACL 2020

Week 14:

Alikhani et al., Clue: Cross-modal Coherence Modeling for Caption Generation. ACL 2020
Agarwal et al., History for Visual Dialog: Do we really need it?. ACL 2020
Barocas and Selbst, Big Data’s Disparate Impact. California Law Review 2016
Hovy and Spruit, The Social Impact of Natural Language Processing. ACL 2016