11-877 AMML | Schedule

** Exact topics and schedule subject to change, based on student interests and course discussions. **

Date	Topics	Readings
1/16	Week 1 Course Introduction [slides] Course syllabus and requirements Multimodal principles: heterogeneity, connections, and interactions Multimodal technical challenges Multimodal research problems	Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions Multimodal Machine Learning: A Survey and Taxonomy Representation Learning: A Review and New Perspectives
1/23	Week 2 Foundation Part1: Dimensions of Heterogeneity [synopsis] What is a taxonomy of the dimensions in which modalities can be heterogeneous? What are intuitive definitions of each dimension of heterogeneity? Heterogeneity is also often seen in several other ML subfields (e.g., domain adaptation, domain shift, transfer learning, multitask learning, federated learning, etc). What are some similarities and differences between the notions of heterogeneity between MMML and these fields? Can definitions or methods in each area be adapted to benefit other research areas? How can we formalize these dimensions of heterogeneity, and subsequently estimate these measures to quantify the degree in which modalities are different? Heterogeneity in noise (e.g., due to sensor and system failures) is a relative understudied dimension. How can we reliably understand the unique noise topologies in modalities, to design more robust models? Modality heterogeneity often implies the design of specialized models capturing the unique properties of each modality. What are some tradeoffs in modality-specific vs modality-general models? Within each of the 6 multimodal challenges - representation, alignment, reasoning, generation, transference, quantification, how can the study of heterogeneity inform various modeling decisions? What problems could happen in practice if heterogeneity is not properly understood or modeled?	Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges Taskonomy: Disentangling Task Transfer Learning Which Tasks Should Be Learned Together in Multi-task Learning? Geometric Dataset Distances via Optimal Transport AutoFed: Heterogeneity-Aware Federated Multimodal Learning for Robust Autonomous Driving Natural Image Noise Dataset Synthetic and Natural Noise Both Break Neural Machine Translation
1/30	Week 3 Foundation Part2: Multimodal Connections [synopsis] What are the reasons why modalities can be connected with each other? Come up with a taxonomy of various dimensions. Think along both statistical, data-driven dimensions and semantic, hypothesis or knowledge driven dimensions. How can we define estimators, where we can accurately quantify the presence of each type of connection given a dataset? Are connections always strong and one-to-one? Reflect on what could make some cross-modal connections stronger or weaker, including many-to-many connections, ambiguity, noises, or adversarial attacks. How can we adapt our learning methods to account for these imperfections? Given trained multimodal models, how can we understand or visualize the nature of connections captured by the model? What benchmarks should we design to probe the quality of learned connections? How can we better learn connections that happen at a very fine-grained and compositional level? Are there new inductive biases we might need to build into vision-language connection models?	When and why vision-language models behave like bags-of-words, and what to do about it? Characterization and classification of semantic image-text relations What Makes for Good Views for Contrastive Learning? Relaxing Contrastiveness in Multimodal Representation Learning Non-Contrastive Learning Meets Language-Image Pre-Training A Taxonomy of Relationships between Images and Text Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning Image-Music-Text A System for Image–text Relations in New (and Old) Media
2/6	Week 4 Foundation Part3: Multimodal Interactions [synopsis] What are the different ways in which modalities can interact with each other when used for prediction tasks? Think across both semantic and statistical perspectives. Can we formalize a taxonomy of such interactions, which will enable us to compare and contrast them more precisely? In fact, should we even try creating such a taxonomy? Can you think of ways modalities could interact with each other, even if there is no prediction task? How are modalities interacting during cross-modal translation? During multimodal generation? Linking back to last week’s discussion, are there cases where modalities are connected but do not interact? Or interact but are not connected? Can we design formal experiments to test either hypothesis? What mathematical or empirical frameworks can be used to formalize the meaning of interactions? How can we subsequently define estimators, where we can accurately quantify the presence of each type of interactions given a dataset? Some definitions (from the semantic category) typically require human interactions to detect and quantify interactions. What are some opportunities and limitations of using human judgment to analyze interactions? Can we potentially design estimators to automate the human labeling process? Can you think of ways to utilize large language models or other foundation models to enhance the learning process of multimodal interactions? How to utilize cognitive theory to design a framework that can be used to understand and learn the interactions between multiple modalities that human beings face everyday?	Training Vision-Language Transformers from Captions Ten Myths of Multimodal Interaction A Vision Check-up for Language Models Scaling Vision-Language Models with Sparse Mixture of Experts Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! Multimodal interaction: A review When do we interact multimodally?: cognitive load and multimodal communication patterns A multimodal parallel architecture: A cognitive framework for multimodal interactions Quantifying and Visualizing Attribute Interactions The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations
2/13	Week 5 Multimodal LLMs Part1: Data, Pretraining, and Scaling Laws [synopsis] What types of multimodal data noise are typically present in multimodal datasets, and how can they negatively impact the performance of a model during training? Can you provide examples of multimodal data points that might be considered noisy? Furthermore, how might we develop estimators capable of distinguishing between noisy and noise-free multimodal data pairs? If you have unlimited fundings to use for data filtering and data cleaning, what would be the ideal way to clean the multimodal dataset? Given the demonstrated effectiveness of high-quality pretraining, as evidenced by projects like Mistral, imagine you have access to a large-scale, high-quality multimodal dataset for pre-training purposes. What types of generalization or additional capabilities might this enable the model to acquire compared to those trained on lower-quality data? Why do models trained with high-quality data obtain such abilities? Considering the diversity of model architectures available for multimodal generation, which architecture would be most suitable for scaling general multimodal generation tasks? Moreover, which model architecture is best equipped to learn complex multimodal interactions effectively? What are some pros and cons of treating data from all modalities equally (throwing them into a single large generative Transformer, after tokenizing the data)? If you were leading a multimodal foundation model project equipped with extensive resources, including a skilled team and significant GPU capabilities, what multimodal architecture and types of multimodal data would you prioritize for an initial pilot study? In exploring the scaling laws of multimodal models, different papers have different definitions for scaling law formulas. Which factors should be incorporated into the scaling law formula, and which among these do you believe is the most critical to consider?	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Scaling Laws for Generative Mixed-Modal Language Models Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization Scaling Laws for Autoregressive Generative Modeling Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Quality Not Quantity: On the Interaction between Datase Design and Robustness of CLIP Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text Improving Multimodal Datasets with Image Captioning Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning LXMERT: Learning Cross-Modality Encoder Representations from Transformers Generative Pretraining in Multimodality Scaling Laws for Transfer
2/20	Week 6 Multimodal LLMs Part2: Fine-tuning, Instructing, Aligning, Model Merging [synopsis] Ensuring the effectiveness of multimodal foundation models through high-quality instruction tuning is vital. A study detailed at here introduces a strategy for selecting significant data specifically suited for enhancing instruction tuning for language models. A primary challenge in this approach is determining which data are most crucial for targeted instruction tuning. How can we accurately identify and select the most impactful data for enhancing instruction tuning in multimodal foundation models? Given the complexity of diverse and multimodal information, what strategies can ensure the effectiveness of instruction tuning data for specific tasks? For model merging, mixture-of-expert-based models enable a new paradigm to utilize multiple expert models for specific tasks. Here shows a promising method to utilize multiple models together. When it comes to multimodal tasks, how might we design a similar system for multimodal tasks that have human-level intelligence? What methodologies could enable the integration of various multimodal models to perform complex tasks such as social interaction effectively? What is the intuition of utilizing frozen large language models as the backbone for multimodal tasks? Which types of encoders would facilitate the integration of diverse information into a format understandable by LLMs? How do these LLMs process and interpret information from different modalities? Considering the various methods available for LLM alignment, is aligning multimodal models perceived to be more challenging or easier? What factors contribute to the difficulty of multimodal alignment, and how might this be related to those previously discussed fundamental parts of multimodal machine learning like interaction and connection? How can we categorize the taxonomy of general AI alignment? Can we classify the AI alignment categories based on the goal of conducting alignment? Assuming the existence of an oracle alignment method, what behaviors would we expect from an aligned AI model? Please list some behaviors that should be exhibited by AI following successful alignment. What is the taxonomy of general AI alignment? Can we classify based on the goal of alignment? Imagine we have an oracle alignment method, what kind of behavior we expect the model to have after alignment? Please list some of the expected behavior that AI should have after alignment. What distinguishes AI alignment from AI personalization? When focusing on AI alignment and personalization, what are the key differences and considerations to keep in mind? Is personalization an easier or harder thing to conduct compared with alignment?	Case Law Grounding: Aligning Judgments of Humans and AI on Socially-Constructed Concepts Aligning AI with shared human values Improved Baselines with Visual Instruction Tuning Value Alignment for Advanced Artificial Judicial Intelligence Improved Baselines with Visual Instruction Tunin Multimodal Few-Shot Learning with Frozen Language Models BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning An Empirical Study of Multimodal Model Merging π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Natural Selection: Favors AIs over Humans Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
2/27	Week 7 Multimodal LLMs Part3: Generative Models and LLMs [synopsis] Connecting with multimodal foundation topics discussed in the previous week, what types of multimodal interactions or connections are large-scale generative models learning to capture? How to link multimodal interactions with generative AI architecture? How to combine mathematical theory related to multimodal interactions to design the next generation of generative AI architecture? With the advancement of generative AI, distinguishing between AI-generated and human-created content is becoming increasingly challenging. Besides watermarking, which has its limitations, are there other effective methods to differentiate between AI-generated and human-created content across various modalities (text, audio, video, image)? Or is it becoming virtually impossible to make this distinction? What is the taxonomy of safety issues, social impact, and ethical concerns associated with generative AI development? How should we update best practices to address these ethical concerns? Who should initiate and lead this dialogue? What steps can be taken to mitigate these ethical issues effectively? Imagine we have an oracle multimodal generative AI system that is used on a large scale. What types of data pollution would it have if most of its data were published on the Internet? When assessing the quality of multimodal outputs from generative AI systems, which dimensions should be prioritized? Can we develop metrics that allow for large-scale evaluation while mitigating potential safety and ethical risks? Diffusion models have shown remarkable performance in controllable text-to-image generation. Could you explain the intuition behind why diffusion models are effective, especially in comparison to other generative AI models like GANs/ VAEs / AR-based LLMs? Some works claim that scaling up GANs can beat diffusion models (here) and some work claims that language models are better than diffusion models for image generation (here). Which generative model family do you think is the most promising one for multimodal generation? For state-of-the-art video generation models like Sora, Yann Lecun mentioned in here that Sora does not understand the real world and its corresponding physical rules. Do you agree with this view? Can the future development of generative AI systems truly incorporate real-world knowledge, or are they limited in this aspect? Is pursuing generative AI a viable path towards achieving Artificial General Intelligence (AGI)?	Generating Images with Multimodal Language Models Holistic Evaluation of Text-to-Image Models Video generation models as world simulators Invisible Image Watermarks Are Provably Removable Using Generative AI VideoPoet: A Large Language Model for Zero-Shot Video Generation Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models High-Resolution Image Synthesis with Latent Diffusion Models AI Deception: A Survey of Examples, Risks, and Potential Solutions Evaluating the Social Impact of Generative AI Systems in Systems and Society Generative AI: Here to stay, but for good? Generative Learning Processes of the Brain VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data Any-to-Any Generation via Composable Diffusion GeoDiff: A Geometric Diffusion Model For Molecular rConformation Generation
3/5	Week 8 No classes – Spring break
3/12	Week 9 Interaction Part1: Reasoning and Large Models [synopsis] Currently, most reasoning models are basically limited to vision and language domains. However, in our real world, multimodal reasoning more broadly exists and has more diverse forms. Can you list a few more examples of multimodal reasoning tasks in our daily life that rely on other modalities and how symbolic or unique reasoning methods can be applied to them? Can you create a taxonomy of all potential symbolic systems that can be helpful for different types of multimodal reasoning tasks like AMR graphs, knowledge graphs and programs? What are their unique advantages and disadvantages? Based on here, are there any other complex reasoning tasks besides reverse logic problems that you think the current foundation models might not handle well? How can neural symbolic models be incorporated to help with those hard cases? Besides here, can you imagine any other potential way to uncover the reasoning capabilities of black-box models, such as large language models and other multimodal foundation models? How can one discover specifically the cross-modal reasoning processes in such a black-box model? To what extent do we need external knowledge when performing reasoning, specifically multimodal reasoning? What type of external knowledge is likely to be needed to succeed in multimodal reasoning? What are the main advantages of reasoning-based approaches, when compared to large-scale multimodal models discussed in the previous lectures? What are the potential issues with reasoning? Can we perform reasoning on very large datasets? Why do pre-training methods eventually learn reasoning processes similar to humans? Or will we still need human and domain knowledge to some extent? Are there unique technical challenges that arise when we consider utilizing neural symbolic methods on multimodal data as performed on multimodal data? What are these unique challenges? How can we start studying these challenges in future research?	Tree of Thoughts: Deliberate Problem Solving with Large Language Models ViperGPT: Visual Inference via Python Execution for Reasoning Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Multimodal Analogical Reasoning Over Knowledge Graphs The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision Neural Scene De-rendering Neurosymbolic AI: the 3rd wave VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering Detecting Out-of-Context Multimodal Misinformation with interpretable neural-symbolic model Neural Module Networks Logical Versus Analogical or Symbolic Versus Connectionist or Neat Versus Scruffy
3/19	Week 10 Interaction Part2: Embodiment and Planning [synopsis] For what sort of embodied tasks might code be a good representation for? For what tasks would it be a poor representation for? When generating embodied plans from natural language, several dimensions of difficulty are the ambiguity in the language, the difficulty of grounding language in the environment, and the difficulty of carrying out the plan in the environment. Consider one or two of the papers from this week – which of these dimensions (or others) do they mainly address? When building an embodied AI, one key challenge is to define an easy and clear action space to ground. Given any particular task like cooking and housekeeping, how to design an appropriate action space that can be easily and accurately grounded? Provide a task and its corresponding designed action space for grounding. Robotics requires more broad multisensory machine learning techniques besides well-studied vision/language multimodal techniques. What are the potential sensory modalities for embodied agent tasks that are not well studied now? What specific embodied tasks require the information from sensory modality to be completed? Based on the release of Figure01 (here), what are the three potential main technical challenges for the next steps of embodied AI and why? What are three potential applications like automatic housekeeping that are still not achievable for robotics now? For embodied AI training, embodied data is widely considered a serious bottleneck. There are a lot of data synthesis works based on virtual or physical environments like here. What are the key challenges for embodied tasks data synthesis and how to make sure that synthesized data are high-quality? What challenges do social settings add beyond standard embodied/robotics tasks?	Voyager: An Open-Ended Embodied Agent with Large Language Models Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots ProgPrompt: Generating Situated Robot Task Plans using Large Language Models Code as Policies: Language Model Programs for Embodied Control Eureka: Human-Level Reward Design via Coding Large Language Models Learning Adaptive Planning Representations with Natural Language Guidance Skill Induction and Planning with Latent Language Buidling Cooperative Embodied Agents Modularly with Large Language Models Human-level play in the game of Diplomacy by combining language models with strategic reasoning Prompt a Robot to Walk with Large Language Models AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents Plan, Eliminate, and Track — Language Models are Good Teachers for Embodied Agents. Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning
3/26	Week 11 Interaction Part3: Interaction with People [synopsis] Humans can provide many different types of feedback to help models accomplish challenging tasks in NLP, robotics, and multimodal tasks (e.g., ranking, scoring, and instructing). What are other types of feedback that can be useful for model training? Can we create a taxonomy of feedback forms, and describe each of their pros and cons? When should we use each type of feedback? In NLP, there’s been a trend of replacing human annotations/feedback with large language models. What are some limitations of this approach? What tasks that are currently done by humans cannot be replaced by large foundation models? What abilities might models need to have to be able to fully replace human annotators? One key aspect of computational pragmatics is how context makes language have meaning beyond what’s literally said. Give some examples of settings that involve multimodal context where the multimodality changes or enriches the literal meaning of the language. Brainstorm some settings where it would be useful for models to adapt to the people they are interacting with. This adaptation could involve the peoples’ language, preferences, and backgrounds. Are these settings within reach of current models? What techniques do you think will be useful to enable adaptation? Are there also societal concerns if these models understand too much of their users? Pick a task that people carry out in pairs or teams, that involves some social or grounded interaction between the people (e.g., pair programming, advising a graduate student, assembling a piece of furniture). How close or far do you think our current AI approaches are from being able to collaborate with the people carrying out this task? What is a research agenda towards enabling human-AI collaboration?	Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation Underspecification in Scene Description-to-Depiction Task Symbolic Planning and Code Generation for Grounded Dialogue Continual adaptation for efficient machine communication CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney Draw Me a Flower: Processing and Grounding Abstraction in Natural Language Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior Continual Learning for Instruction Following from Realtime Feedback Computational Language Acquisition with Theory of Mind Multi-agent communication meets natural language: synergies between functional and structural language learning Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts Speaking the Language of Your Listener: Audience-Aware Adaptation via Plug-and-Play Theory of Mind Human-in-the-loop Abstractive Dialogue Summarization Pragmatic Image Compression for Human-in-the-Loop Decision-Making Werewolf Among Us: A Multimodal Dataset for Modeling Persuasion Behaviors in Social Deduction Games The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers The Future of Human-in-the-Loop Cyber-Physical Systems
4/2	Week 12 Ethics and Safety [synopsis] What are some ways to assess the trustworthiness of LLMs? How does the problem become harder when these LLMs are multimodal in the input and output? How can our earlier discussions on multimodal interactions, reasoning, etc give new insights on improving the trust and safety of multimodal LLMs? When are multimodal models more robust to adversarial attacks? When are they more susceptible? Why do these both occur and how can it inform our design of robust multimodal systems? What are the qualities we should consider when evaluating outputs from multimodal generative AI? What do you think is the best practice to evaluate these qualities? Can we efficiently evaluate these qualities, at scale? What are the real-world ethical issues regarding multimodal models? How can we build a taxonomy of the main ethical concerns, so that we can systematically evaluate and combat them? What are some ethical concerns that you are worried about, but not already popularized in mainstream media? How can we update our best practices to help address these ethical concerns? Who is better placed to start this dialogue? The academic researcher, industry, policymakers, or more? How can we make significant changes in this direction of highlighting and mitigating ethical issues? Facing a foundation model system, what types of attack can you do to make the system not work or perform worse? What is the taxonomy of the attack that a user can make? What types of safety issue are identified based on different types of attacks? When discussing the robustness of one model, what can an ideal robust multimodal model do? Compared to multimodal models and unimodal models, which kinds of models do you think that is more robust? Briefly describe the reason why you think one type is more robust than the other when facing a particular problem. Jailbreaking for foundation models is a commonly discussed topic. What is the root cause of the model to be able to be jailbroken? What are the potential ways to avoid such attacks and build guardrails?	DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models On Robustness in Multimodal Learning Can LLM-generated misinformation be detected? Fine-tuning aligned language models compromises safety, even when users do not intend to! Jailbreaking Attack against Multimodal Large Language Model Are Multimodal Transformers Robust to Missing Modality? Towards Adversarial Attack on Vision-Language Pre-training Models Multimodal machine learning in precision health: A scoping review RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory What Does it Mean for a Language Model to Preserve Privacy? Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
4/9	Week 13 Efficiency and Privacy [synopsis] Different papers have different definitions of efficiency, including memory, time, space etc. Are there other notions of efficiency that you think current work is missing out on, especially as we build multimodal systems for the real world? How can we make progress on these new notions? How can our study of multimodal connections/interactions help us design more efficient models? How should we balance careful and efficient model design from the start, versus training large models and compressing them as a post-hoc step? How can we scale multimodal models to extremely long sequence lengths, such as over years of human experience? What new capabilities will this enable? How can we start creating benchmarks to make progress toward these capabilities? There has been a lot of work on making language models and vision models more efficient - what ideas here can be translated to other modalities and other multimodal problems? What new domain expertise will we need to build efficient models for these other settings? Most works in improving efficiency fixes the modalities and makes the models more efficiency. Are there potential ideas on changing the modalities themselves so that they can be more efficiently handled (e.g., going from video to images or wireless sensors?) How can we formalize the balance between information, fidelity, efficiency, and privacy of different modalities, and how can we choose which ones to use for a given problem?	Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning SAMoSA: Sensing Activities with Motion and Subsampled Audio SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models Generating Long Sequences with Sparse Transformers Efficient and accurate prediction of protein structure using RoseTTAFold2 Efficiently Scaling Transformer Inference Privacy Enhanced Multimodal Neural Representations for Emotion Recognition QLORA: Efficient Finetuning of Quantized LLMs Mamba: Linear-Time Sequence Modeling with Selective State Spaces Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models
4/16	Week 14 Open Discussion [synopsis]
4/23	Week 15 Project presentations