Mohit Bansal
John R. & Louise S. Parker Professor of Computer Science
University of North Carolina at Chapel Hill
Multimodal Generative LLMs: Unification, Interpretability, Evaluation
In this talk, I will present our journey of large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts, etc.) and enhancing their important aspects such as unification (for generalizability, shared knowledge, and efficiency), interpretable programming/planning (for controllability and faithfulness), and evaluation (of fine-grained skills, faithfulness, and social biases). We will start by discussing early cross-modal vision-and-language pretraining models (LXMERT). We will then look at early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual commonsense reasoning, captioning, and multimodal translation) by treating all tasks as text generation. We will next look at recent, progressively more unified models (with joint objectives and architecture, as well as newer unified modalities during encoding and decoding) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), interactive, interleaved, composable any-to-any text-audio-image-video multimodal generation (CoDi, CoDi-2), and compositional video reasoning with optical flow, 3D point cloud, audio (CREMA). Second, we will discuss interpretable and controllable multimodal generation (to improve faithfulness) via LLM-based planning and programming, such as layout-controllable image generation via visual programming (VPGen), consistent multi-scene video generation via LLM-guided planning (VideoDirectorGPT), and open-domain, open-platform diagram generation (DiagrammerGPT). I will conclude with important faithfulness and bias evaluation aspects of multimodal generation models, based on fine-grained skill and social bias evaluation (DALL-Eval), interpretable and explainable visual programs (VPEval), as well as reliable fine-grained evaluation via Davidsonian semantics based scene graphs (DSG).
Biosketch
Dr. Mohit Bansal is the John R. & Louise S. Parker Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science Department at the University of North Carolina (UNC) at Chapel Hill. Prior to this, he was a research assistant professor (a 3-year endowed position) at TTI-Chicago. He received his Ph.D. in 2013 from the University of California at Berkeley (where he was advised by Dan Klein) and his B.Tech. from the Indian Institute of Technology at Kanpur in 2008. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on grounded and embodied semantics, language generation and Q&A/dialogue, and interpretable and generalizable deep learning.
Dr. Bansal is a recipient of the IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. He has been a keynote speaker for the AACL 2023 and INLG 2022 conferences. His service includes ACL Executive Committee, ACM Doctoral Dissertation Award Committee, CoNLL Program Co-Chair, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals.