Exact topics and schedule subject to change, based on student interests and course discussions.
1/20 |
Week 1: Course Introduction [slides]
-
Course syllabus and requirements
-
Multimodal principles: heterogeneity, connections, and interactions
-
Multimodal technical challenges
|
|
1/27 |
Week 2: Dimensions of Heterogeneity [synopsis]
-
What is a taxonomy of the dimensions in which modalities can be heterogeneous?
-
Heterogeneity is also often seen in several other ML subfields (e.g., domain adaptation, domain shift, transfer learning, multitask learning, federated learning, etc). What are some similarities and differences between the notions of heterogeneity between MMML and these fields? Can definitions or methods in each area be adapted to benefit other research areas?
-
How can we formalize these dimensions of heterogeneity, and subsequently estimate these measures to quantify the degree in which modalities are different?
-
Modality heterogeneity often implies the design of specialized models capturing the unique properties of each modality. What are some tradeoffs in modality-specific vs modality-general models?
-
What are other modeling considerations that ideally should be informed based on how heterogeneous the input modalities are?
-
What are some risks if we were to ignore modality or task heterogeneity? What if we are unable to estimate modality or task heterogeneity accurately?
|
|
2/3 |
Week 3: Modality Connections [synopsis]
-
What are the reasons why modalities can be connected with each other? Come up with a taxonomy of various dimensions. Think along both statistical, data-driven dimensions and semantic, hypothesis or knowledge driven dimensions. What are the pros and cons of either approach in understanding modality connections?
-
Are connections always strong and one-to-one? Reflect on what could make some cross-modal connections stronger or weaker, including many-to-many connections, ambiguity or noises?
-
Given trained multimodal models, how can we understand or visualize the nature of connections captured by the model?
-
What formalism or framework could be used to formalize cross-modal connections? How can we subsequently define estimators, where we can accurately quantify the presence of each type of connection given a dataset? How much knowledge of each modality do we need in order to estimate modality connections?
-
Linking back to week 1’s discussion on heterogeneity, how would you relate the concepts of heterogeneity and connections? How is heterogeneity affecting the study of crossmodal connections and inversely, how connections should be taken into consideration when heterogeneity is studied? Are connections also present in homogenous settings?
|
|
2/10 |
Week 4: Modality Interactions [synopsis]
-
What are the different ways in which modalities can interact with each other when used for a prediction tasks? Think across both semantic and statistical perspectives. Can we formalize a taxonomy of such interactions, which will enable us to compare and contrast them more precisely? In fact, should we even try creating such a taxonomy?
-
Can you think of ways modalities could interact with each others, even if there is no prediction task? How are modalities interacting during cross-modal translation? During multimodal generation?
-
Linking back to last week’s discussion, are there cases where modalities are connected but do not interact? Or interact but are not connected? Can we design formal experiments to test either hypotheses?
-
What mathematical or empirical frameworks can be used to formalize the meaning of interactions? How can we subsequently define estimators, where we can accurately quantify the presence of each type of interactions given a dataset?
-
Some definitions (from the semantic category) typically require human interactions to detect and quantify interactions. What are some opportunities and limitations of using human judgment to analyze interactions? Can we potentially design estimators to automate the human labeling process?
-
What are the design decisions (aka inductive biases) that can be used when modeling each type of interaction in machine learning models?
-
What are the advantages and drawbacks of designing models to capture each type of cross-modal interaction? Consider not just prediction performance, but tradeoffs in time/space complexity, interpretability, etc.
|
|
2/17 |
Week 5: Modality Utility, Tradeoffs, and Selection [synopsis]
-
What are the different ways in which modalities can be useful for a task? How can we measure the utility of a modality for a task, given only access to the dataset (i.e., before designing and training a model)?
-
What are the criterion for which we should add or select modalities for a task? Is minimizing redundancy the only goal? Are there benefits in maximizing redundancy? Are there other criterion we should consider too?
-
Is modality selection the same as feature selection? What are the potential differences and new technical challenges in modality selection but not present in conventional feature selection?
-
Given trained models, how can we estimate how important each modality was when making the prediction? How were these modalities used separately and in interaction with other modalities?
-
What are the different ways in which modalities can be harmful for a task? Think about a list of reasons why we would prefer to not use a modality. How can we quantify these potential risks?
-
What are some solutions for tackling these risks and biases in multimodal datasets and models? How can we properly identify, visualize and eventually reduce these risks in multimodal datasets and models?
-
Can we come up with guidelines that compare the tradeoffs between modality benefits and risks? How can we then integrate these insights into multimodal model design? Will integrating these insights help?
|
|
2/24 |
Week 6: Pretraining and Scaling [synopsis]
-
Is large-scale pretraining the way forward for building general AI models? What information potentially cannot be captured by pretraining? What are some potential risks of pretraining and scenarios where we should not use pretrained models?
-
How can we, in an academic environment, do impactful research in multimodal pretraining? What would be your proposed multi-year research agenda in this topic?
-
What are the types of cross-modal interactions that are likely to be modeled by current pretrained models? What cross-modal interactions will be harder to model with these methods? Do you have proposals for different pretraining data, architectures, or objectives that can better capture these interactions?
-
How can we best integrate multimodality into pretrained language models? What kind of additional data and modeling/optimization decisions do we need?
-
What are the different design decisions when integrating multimodal information in pretraining models and objectives? What are the main advantages and drawbacks of these design choices? Consider not just prediction performance, but tradeoffs in time/space complexity, interpretability, and so on.
-
How can we evaluate the type of multimodal information learned in pretrained models? One approach is to look at downstream tasks, but what are other ways to uncover the knowledge stored in pretrained models?
|
|
3/3 |
Week 7: Multimodal Reasoning [synopsis]
-
What is reasoning, and what are its subchallenges? What could be a taxonomy of the main processes involved in reasoning? What are some potential formal definitions for reasoning?
-
Are there unique technical challenges that arise because reasoning is performed on multimodal versus unimodal data? How can we start studying these challenges in future research? Try to link it back to our previous definition of heterogeneity, connections, and interactions when thinking about multimodal reasoning challenges.
-
What are the main advantages of reasoning-based approaches, when compared to the large-scale pre-training methods discussed last week? What are the potential issues with reasoning-based methods? Can we come up with a research agenda that combines the best of both worlds?
-
Can we perform reasoning on very large datasets? Can pre-training methods eventually learn reasoning processes similar to humans? Or will we still need human and domain knowledge to some extent?
-
What are some ways to uncover the reasoning capabilities of multimodal models? What additional techniques do we need over measuring reasoning of unimodal models?
-
To what extent do we need external knowledge when performing reasoning, specifically multimodal reasoning? What type of external knowledge is likely to be needed to succeed in multimodal reasoning?
|
|
3/10 |
Week 8: No classes – Spring break
|
|
3/17 |
Week 9: Multimodal Co-learning [synopsis]
-
We define co-learning broadly as multimodal data and training helping performance on unimodal tasks. Under what scenarios will co-learning occur? Why is that research has demonstrated both positive and negative results? What assumptions do we have to make on the heterogeneity of data sources and the nature of connections and interactions between modalities for co-learning to be successful?
-
How can we formally, empirically, or intuitively measure the additional information provided by auxiliary modalities for co-learning? How can we design controlled experiments to test these hypotheses?
-
What are some design decisions (modeling, training, objective functions) that could be made to promote co-learning from one modality to another? What is a taxonomy of approaches and their pros and cons?
-
Text is usually the modality used for additional supervision. Why is text such a popular choice? Can other modalities also be used for additional supervision, and how would co-learning methods work differently?
-
How do we measure what information is transferred during co-learning? How do we ensure that only useful information is transferred, and not some undesirable bias or shortcuts?
-
How can we know if co-learning has succeeded or failed? What approaches could we develop to visualize and probe the success of co-learning, beyond target task performance?
-
What are the advantages and drawbacks of information transfer during co-learning? Consider not just prediction performance, but also tradeoffs with increased complexity, interpretability, biases, etc. Can we come up with a guideline for when we should use co-learning, when the benefits outweigh the additional costs?
|
|
3/24 |
Week 10: Multimodal Generation and Ethics [synopsis]
-
What are the qualities we should consider when evaluating outputs from multimodal generation? What do you think is the best practice to evaluate these qualities? Can we efficiently evaluate these qualities, at scale?
-
What are some challenges in multimodal generation beyond generating each modality individually? How can we synchronize generation across multiple modalities?
-
What aspects of multimodal are prerequisites for generation to be possible? For example, how much do models need to learn regarding heterogeneity, connections, and interactions?
-
There have been many directions towards conditional generation without fully paired data, or paired data at more coarse granularities (e.g., text-video generation using only text-image data). What is a taxonomy of weak supervision approaches for generation? How do we know what type of data is necessary for accurate generation?
-
What are the opportunities and challenges of automatic and human evaluation? How can we combine the best of both worlds?
-
What are the real-world ethical issues regarding generation? How are these risks potentially amplified or reduced when the dataset is multimodal, with heterogeneous modalities? Are there any ethical issues that are specific to multimodal generation?
-
How can we build a taxonomy of the main ethical concerns related to multimodal generation?
-
How can we update our best practices to help address these ethical concerns? Who is better placed to start this dialogue? How can we make significant changes in this direction of reducing ethical issues?
|
|
3/31 |
Week 11: Multimodal LMs and the Future [synopsis]
-
Think of multimodal research problems and technical challenges that are becoming more important and possibly enabled by GPT-like models. Out of these, which problems are academia particularly well suited to work on?
-
Think of multimodal research problems and technical challenges that became less relevant or maybe even solved by GPT-like models.
-
At a high-level (1-minute elevator pitch), describe one specific multimodal research project that you could embark on enabled by GPT-like models. Describe the key research questions, technical challenges, evaluation criteria, and broader impact.
-
At a high-level (1-minute elevator pitch), describe one real-world product idea enabled by GPT-like models. Prepare a sales pitch: current shortcomings, motivation, broad impact, potential technical challenges, any real-world deployment issues you could face, and evaluating success and impact. Who are the stakeholders who might use the product? How do you think this product will help them?
-
What could the future of pretrained models look like? More modalities, more generative capabilities, more personalization, more efficiency, or what else? Which fundamental multimodal technical challenges will arise as more multimodal pretrained models are created?
-
How can we, in an academic environment, do impactful research in multimodal given the success of these pretrained models? What would be your proposed 10-year research agenda in multimodal ML, assuming you had access to funding and researchers?
|
|
4/7 |
Week 12: Brain and Multimodal Perception [synopsis]
-
What are the main takeaways from neuroscience regarding unimodal and multimodal processing, integration, translation, and co-learning, that are less known in conventional multimodal ML?
-
How can these insights inform our design of multimodal models, following the challenges we covered previously (connections, interactions, co-learning, pre-training, reasoning etc.)?
-
To what extent should we design AI models with the explicit goal to mirror human perception and reasoning, versus relying on large-scale pre-training methods and general neural networks?
-
How does the human brain represent different modalities (visual, acoustic, touch)? Are these different modalities represented in very heterogeneous ways? How is information linked between modalities?
-
What are several challenges and opportunities in multimodal learning from brain imaging modalities? How do these modalities introduce new challenges not seen in conventional language and vision research?
-
What are some ways in which multimodal learning can help in the future analysis of data collected in neuroscience? What unique challenges arise in this new research direction, beyond classical multimodal learning?
|
|
4/14 |
Week 13: No classes – CMU Carnival
|
|
4/21 |
Week 14: Open Discussion
|
|
4/28 |
Week 15: Project presentations
|
|