** Exact topics and schedule subject to change, based on student interests and course discussions. **

Date Topics Readings
1/16 Week 1 Course Introduction [slides]
  • Course syllabus and requirements
  • Multimodal principles: heterogeneity, connections, and interactions
  • Multimodal technical challenges
  • Multimodal research problems
1/23 Week 2 Foundation Part1: Dimensions of Heterogeneity [synopsis]
  • What is a taxonomy of the dimensions in which modalities can be heterogeneous? What are intuitive definitions of each dimension of heterogeneity?
  • Heterogeneity is also often seen in several other ML subfields (e.g., domain adaptation, domain shift, transfer learning, multitask learning, federated learning, etc). What are some similarities and differences between the notions of heterogeneity between MMML and these fields? Can definitions or methods in each area be adapted to benefit other research areas?
  • How can we formalize these dimensions of heterogeneity, and subsequently estimate these measures to quantify the degree in which modalities are different?
  • Heterogeneity in noise (e.g., due to sensor and system failures) is a relative understudied dimension. How can we reliably understand the unique noise topologies in modalities, to design more robust models?
  • Modality heterogeneity often implies the design of specialized models capturing the unique properties of each modality. What are some tradeoffs in modality-specific vs modality-general models?
  • Within each of the 6 multimodal challenges - representation, alignment, reasoning, generation, transference, quantification, how can the study of heterogeneity inform various modeling decisions? What problems could happen in practice if heterogeneity is not properly understood or modeled?
1/30 Week 3 Foundation Part2: Multimodal Connections [synopsis]
  • What are the reasons why modalities can be connected with each other? Come up with a taxonomy of various dimensions. Think along both statistical, data-driven dimensions and semantic, hypothesis or knowledge driven dimensions. How can we define estimators, where we can accurately quantify the presence of each type of connection given a dataset?
  • Are connections always strong and one-to-one? Reflect on what could make some cross-modal connections stronger or weaker, including many-to-many connections, ambiguity, noises, or adversarial attacks. How can we adapt our learning methods to account for these imperfections?
  • Given trained multimodal models, how can we understand or visualize the nature of connections captured by the model? What benchmarks should we design to probe the quality of learned connections?
  • How can we better learn connections that happen at a very fine-grained and compositional level? Are there new inductive biases we might need to build into vision-language connection models?
2/6 Week 4 Foundation Part3: Multimodal Interactions [synopsis]
  • What are the different ways in which modalities can interact with each other when used for prediction tasks? Think across both semantic and statistical perspectives. Can we formalize a taxonomy of such interactions, which will enable us to compare and contrast them more precisely? In fact, should we even try creating such a taxonomy?
  • Can you think of ways modalities could interact with each other, even if there is no prediction task? How are modalities interacting during cross-modal translation? During multimodal generation?
  • Linking back to last week’s discussion, are there cases where modalities are connected but do not interact? Or interact but are not connected? Can we design formal experiments to test either hypothesis?
  • What mathematical or empirical frameworks can be used to formalize the meaning of interactions? How can we subsequently define estimators, where we can accurately quantify the presence of each type of interactions given a dataset?
  • Some definitions (from the semantic category) typically require human interactions to detect and quantify interactions. What are some opportunities and limitations of using human judgment to analyze interactions? Can we potentially design estimators to automate the human labeling process?
  • Can you think of ways to utilize large language models or other foundation models to enhance the learning process of multimodal interactions?
  • How to utilize cognitive theory to design a framework that can be used to understand and learn the interactions between multiple modalities that human beings face everyday?
2/13 Week 5 Multimodal LLMs Part1: Data, Pretraining, and Scaling Laws [synopsis]
  • What types of multimodal data noise are typically present in multimodal datasets, and how can they negatively impact the performance of a model during training? Can you provide examples of multimodal data points that might be considered noisy? Furthermore, how might we develop estimators capable of distinguishing between noisy and noise-free multimodal data pairs? If you have unlimited fundings to use for data filtering and data cleaning, what would be the ideal way to clean the multimodal dataset?
  • Given the demonstrated effectiveness of high-quality pretraining, as evidenced by projects like Mistral, imagine you have access to a large-scale, high-quality multimodal dataset for pre-training purposes. What types of generalization or additional capabilities might this enable the model to acquire compared to those trained on lower-quality data? Why do models trained with high-quality data obtain such abilities?
  • Considering the diversity of model architectures available for multimodal generation, which architecture would be most suitable for scaling general multimodal generation tasks? Moreover, which model architecture is best equipped to learn complex multimodal interactions effectively?
  • What are some pros and cons of treating data from all modalities equally (throwing them into a single large generative Transformer, after tokenizing the data)?
  • If you were leading a multimodal foundation model project equipped with extensive resources, including a skilled team and significant GPU capabilities, what multimodal architecture and types of multimodal data would you prioritize for an initial pilot study?
  • In exploring the scaling laws of multimodal models, different papers have different definitions for scaling law formulas. Which factors should be incorporated into the scaling law formula, and which among these do you believe is the most critical to consider?
2/20 Week 6 Multimodal LLMs Part2: Fine-tuning, Instructing, Aligning, Model Merging [synopsis]
  • Ensuring the effectiveness of multimodal foundation models through high-quality instruction tuning is vital. A study detailed at here introduces a strategy for selecting significant data specifically suited for enhancing instruction tuning for language models. A primary challenge in this approach is determining which data are most crucial for targeted instruction tuning. How can we accurately identify and select the most impactful data for enhancing instruction tuning in multimodal foundation models? Given the complexity of diverse and multimodal information, what strategies can ensure the effectiveness of instruction tuning data for specific tasks?
  • For model merging, mixture-of-expert-based models enable a new paradigm to utilize multiple expert models for specific tasks. Here shows a promising method to utilize multiple models together. When it comes to multimodal tasks, how might we design a similar system for multimodal tasks that have human-level intelligence? What methodologies could enable the integration of various multimodal models to perform complex tasks such as social interaction effectively?
  • What is the intuition of utilizing frozen large language models as the backbone for multimodal tasks? Which types of encoders would facilitate the integration of diverse information into a format understandable by LLMs? How do these LLMs process and interpret information from different modalities?
  • Considering the various methods available for LLM alignment, is aligning multimodal models perceived to be more challenging or easier? What factors contribute to the difficulty of multimodal alignment, and how might this be related to those previously discussed fundamental parts of multimodal machine learning like interaction and connection?
  • How can we categorize the taxonomy of general AI alignment? Can we classify the AI alignment categories based on the goal of conducting alignment? Assuming the existence of an oracle alignment method, what behaviors would we expect from an aligned AI model? Please list some behaviors that should be exhibited by AI following successful alignment.
  • What is the taxonomy of general AI alignment? Can we classify based on the goal of alignment? Imagine we have an oracle alignment method, what kind of behavior we expect the model to have after alignment? Please list some of the expected behavior that AI should have after alignment.
  • What distinguishes AI alignment from AI personalization? When focusing on AI alignment and personalization, what are the key differences and considerations to keep in mind? Is personalization an easier or harder thing to conduct compared with alignment?
2/27 Week 7 Multimodal LLMs Part3: Generative Models and LLMs [synopsis]
  • Connecting with multimodal foundation topics discussed in the previous week, what types of multimodal interactions or connections are large-scale generative models learning to capture? How to link multimodal interactions with generative AI architecture? How to combine mathematical theory related to multimodal interactions to design the next generation of generative AI architecture?
  • With the advancement of generative AI, distinguishing between AI-generated and human-created content is becoming increasingly challenging. Besides watermarking, which has its limitations, are there other effective methods to differentiate between AI-generated and human-created content across various modalities (text, audio, video, image)? Or is it becoming virtually impossible to make this distinction?
  • What is the taxonomy of safety issues, social impact, and ethical concerns associated with generative AI development? How should we update best practices to address these ethical concerns? Who should initiate and lead this dialogue? What steps can be taken to mitigate these ethical issues effectively? Imagine we have an oracle multimodal generative AI system that is used on a large scale. What types of data pollution would it have if most of its data were published on the Internet?
  • When assessing the quality of multimodal outputs from generative AI systems, which dimensions should be prioritized? Can we develop metrics that allow for large-scale evaluation while mitigating potential safety and ethical risks?
  • Diffusion models have shown remarkable performance in controllable text-to-image generation. Could you explain the intuition behind why diffusion models are effective, especially in comparison to other generative AI models like GANs/ VAEs / AR-based LLMs? Some works claim that scaling up GANs can beat diffusion models (here) and some work claims that language models are better than diffusion models for image generation (here). Which generative model family do you think is the most promising one for multimodal generation?
  • For state-of-the-art video generation models like Sora, Yann Lecun mentioned in here that Sora does not understand the real world and its corresponding physical rules. Do you agree with this view? Can the future development of generative AI systems truly incorporate real-world knowledge, or are they limited in this aspect? Is pursuing generative AI a viable path towards achieving Artificial General Intelligence (AGI)?
3/5 Week 8 No classes – Spring break
3/12 Week 9 Interaction Part1: Reasoning and Large Models [synopsis]
  • Currently, most reasoning models are basically limited to vision and language domains. However, in our real world, multimodal reasoning more broadly exists and has more diverse forms. Can you list a few more examples of multimodal reasoning tasks in our daily life that rely on other modalities and how symbolic or unique reasoning methods can be applied to them?
  • Can you create a taxonomy of all potential symbolic systems that can be helpful for different types of multimodal reasoning tasks like AMR graphs, knowledge graphs and programs? What are their unique advantages and disadvantages?
  • Based on here, are there any other complex reasoning tasks besides reverse logic problems that you think the current foundation models might not handle well? How can neural symbolic models be incorporated to help with those hard cases?
  • Besides here, can you imagine any other potential way to uncover the reasoning capabilities of black-box models, such as large language models and other multimodal foundation models? How can one discover specifically the cross-modal reasoning processes in such a black-box model?
  • To what extent do we need external knowledge when performing reasoning, specifically multimodal reasoning? What type of external knowledge is likely to be needed to succeed in multimodal reasoning?
  • What are the main advantages of reasoning-based approaches, when compared to large-scale multimodal models discussed in the previous lectures? What are the potential issues with reasoning? Can we perform reasoning on very large datasets? Why do pre-training methods eventually learn reasoning processes similar to humans? Or will we still need human and domain knowledge to some extent?
  • Are there unique technical challenges that arise when we consider utilizing neural symbolic methods on multimodal data as performed on multimodal data? What are these unique challenges? How can we start studying these challenges in future research?
3/19 Week 10 Interaction Part2: Embodiment and Planning [synopsis]
  • For what sort of embodied tasks might code be a good representation for? For what tasks would it be a poor representation for?
  • When generating embodied plans from natural language, several dimensions of difficulty are the ambiguity in the language, the difficulty of grounding language in the environment, and the difficulty of carrying out the plan in the environment. Consider one or two of the papers from this week – which of these dimensions (or others) do they mainly address?
  • When building an embodied AI, one key challenge is to define an easy and clear action space to ground. Given any particular task like cooking and housekeeping, how to design an appropriate action space that can be easily and accurately grounded? Provide a task and its corresponding designed action space for grounding.
  • Robotics requires more broad multisensory machine learning techniques besides well-studied vision/language multimodal techniques. What are the potential sensory modalities for embodied agent tasks that are not well studied now? What specific embodied tasks require the information from sensory modality to be completed?
  • Based on the release of Figure01 (here), what are the three potential main technical challenges for the next steps of embodied AI and why? What are three potential applications like automatic housekeeping that are still not achievable for robotics now?
  • For embodied AI training, embodied data is widely considered a serious bottleneck. There are a lot of data synthesis works based on virtual or physical environments like here. What are the key challenges for embodied tasks data synthesis and how to make sure that synthesized data are high-quality?
  • What challenges do social settings add beyond standard embodied/robotics tasks?
3/26 Week 11 Interaction Part3: Interaction with People [synopsis]
  • Humans can provide many different types of feedback to help models accomplish challenging tasks in NLP, robotics, and multimodal tasks (e.g., ranking, scoring, and instructing). What are other types of feedback that can be useful for model training? Can we create a taxonomy of feedback forms, and describe each of their pros and cons? When should we use each type of feedback?
  • In NLP, there’s been a trend of replacing human annotations/feedback with large language models. What are some limitations of this approach? What tasks that are currently done by humans cannot be replaced by large foundation models? What abilities might models need to have to be able to fully replace human annotators?
  • One key aspect of computational pragmatics is how context makes language have meaning beyond what’s literally said. Give some examples of settings that involve multimodal context where the multimodality changes or enriches the literal meaning of the language.
  • Brainstorm some settings where it would be useful for models to adapt to the people they are interacting with. This adaptation could involve the peoples’ language, preferences, and backgrounds. Are these settings within reach of current models? What techniques do you think will be useful to enable adaptation? Are there also societal concerns if these models understand too much of their users?
  • Pick a task that people carry out in pairs or teams, that involves some social or grounded interaction between the people (e.g., pair programming, advising a graduate student, assembling a piece of furniture). How close or far do you think our current AI approaches are from being able to collaborate with the people carrying out this task? What is a research agenda towards enabling human-AI collaboration?
4/2 Week 12 Ethics and Safety [synopsis]
  • What are some ways to assess the trustworthiness of LLMs? How does the problem become harder when these LLMs are multimodal in the input and output? How can our earlier discussions on multimodal interactions, reasoning, etc give new insights on improving the trust and safety of multimodal LLMs?
  • When are multimodal models more robust to adversarial attacks? When are they more susceptible? Why do these both occur and how can it inform our design of robust multimodal systems?
  • What are the qualities we should consider when evaluating outputs from multimodal generative AI? What do you think is the best practice to evaluate these qualities? Can we efficiently evaluate these qualities, at scale?
  • What are the real-world ethical issues regarding multimodal models? How can we build a taxonomy of the main ethical concerns, so that we can systematically evaluate and combat them? What are some ethical concerns that you are worried about, but not already popularized in mainstream media?
  • How can we update our best practices to help address these ethical concerns? Who is better placed to start this dialogue? The academic researcher, industry, policymakers, or more? How can we make significant changes in this direction of highlighting and mitigating ethical issues?
  • Facing a foundation model system, what types of attack can you do to make the system not work or perform worse? What is the taxonomy of the attack that a user can make? What types of safety issue are identified based on different types of attacks?
  • When discussing the robustness of one model, what can an ideal robust multimodal model do? Compared to multimodal models and unimodal models, which kinds of models do you think that is more robust? Briefly describe the reason why you think one type is more robust than the other when facing a particular problem.
  • Jailbreaking for foundation models is a commonly discussed topic. What is the root cause of the model to be able to be jailbroken? What are the potential ways to avoid such attacks and build guardrails?
4/9 Week 13 Efficiency and Privacy [synopsis]
  • Different papers have different definitions of efficiency, including memory, time, space etc. Are there other notions of efficiency that you think current work is missing out on, especially as we build multimodal systems for the real world? How can we make progress on these new notions?
  • How can our study of multimodal connections/interactions help us design more efficient models? How should we balance careful and efficient model design from the start, versus training large models and compressing them as a post-hoc step?
  • How can we scale multimodal models to extremely long sequence lengths, such as over years of human experience? What new capabilities will this enable? How can we start creating benchmarks to make progress toward these capabilities?
  • There has been a lot of work on making language models and vision models more efficient - what ideas here can be translated to other modalities and other multimodal problems? What new domain expertise will we need to build efficient models for these other settings?
  • Most works in improving efficiency fixes the modalities and makes the models more efficiency. Are there potential ideas on changing the modalities themselves so that they can be more efficiently handled (e.g., going from video to images or wireless sensors?)
  • How can we formalize the balance between information, fidelity, efficiency, and privacy of different modalities, and how can we choose which ones to use for a given problem?
4/16 Week 14 Open Discussion [synopsis]
4/23 Week 15 Project presentations