The 15th ACM International Conference on Multimedia Retrieval
Bio: Nicu Sebe is a professor in the University of Trento, Italy, where he is leading the research in the areas of multimedia information retrieval and human-computer interaction in computer vision applications. He received his PhD from the University of Leiden, The Netherlands and has been in the past with the University of Amsterdam, The Netherlands and the University of Illinois at Urbana-Champaign, USA. He was involved in the organization of the major conferences and workshops addressing the computer vision and human-centered aspects of multimedia information retrieval, among which as a General Co-Chair of the IEEE Automatic Face and Gesture Recognition Conference, FG 2008, ACM International Conference on Multimedia Retrieval (ICMR) 2017 and ACM Multimedia 2013 and 2022. He was a program chair of ACM Multimedia 2011 and 2007, ECCV 2016, ICCV 2017, ICPR 2020. He is a fellow of ELLIS, IAPR and a Senior member of ACM and IEEE. He is the co-editor in chief of Computer Vision and Image Understanding journal.
Topic: Unveiling Bias and Safety Issues in Generative Models
Abstract: Recent advances in generative models—Text-to-Image (T2I) and vision-language models (VLMs)—have improved image quality and accessibility, but fairness and safety issues remain. Current bias detection often relies on fixed categories or focuses only on unsafe inputs. We address these limitations with two contributions: an open-set framework for T2I models using LLMs and Visual Question Answering to detect and explain a wide range of biases and a training-free method for VLMs (Unsafe Weights Manipulation) paired with a new evaluation metric (SafeGround) to improve safety without harming performance on safe inputs. Together, these methods advance fairness and robustness in generative AI.
Bio: K. Selçuk Candan is a Professor of Computer Science and Engineering at the Arizona State University and the Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE). His re- search is at the nexus of data management and media understanding. He has published over 200 journal and peer-reviewed conference articles, one book, and 16 book chapters. He has 9 patents. Prof. Candan served as an associate editor for the Very Large Databases (VLDB) journal, IEEE Transactions on Multimedia, Journal of Multimedia, IEEE Transactions on Knowledge and Data Engineering, and IEEE Transactions on Cloud Computing. He is currently an Associate Editor for the ACM Transactions on Database Systems and the Proceedings of the VLDB (PVLDB). He also served as a founding managing editor for ACM PACMMOD journal. He served as a member of the Executive Committees of ACM SIGMOD and SIGMM and is a member of the Steering committee for ACM ICMR. He is an ACM Distinguished Scientist and a recipient of the SIG- MOD Contributions Award in 2023. You can find more information about his research at http://kscandan.site.
Title: The Power of “Why?” in Decision Making in Complex, Dynamic Systems
Abstract: Understanding the underlying dynamics of emerging phenomena are increasingly critical in various application domains, from social media trends to predicting geo-temporal evolution of epidemics to helping reduce energy footprints of buildings. Addressing the most pressing societal challenges requires (a) a deep understanding of the relationships and interactions among diverse, spatially and temporally distributed entities and (b) the capability to develop, and explain, informed forecasts based on such an understanding. In this talk, I argue that achieving these necessitates the ability to use spatio-temporal information to gain causal situational awareness and also to leverage such causal information to tackle both aleatoric and epistemic uncertainties in decision making. Despite the apparent promise of such a causally-grounded approach, the core technologies required to achieve this are in their early stages and frameworks to realize their potential are still lacking. In this keynote, I will argue for a vision for causal awareness in algorithms and applications and highlight our community’s role in this context.
Bio: Sergey Tulyakov is the Director of Research at Snap Inc., where he leads the Creative Vision team. Sergey’s work focuses on building technology to enhance creators’ skills using computer vision, ma- chine learning, and generative AI. His work involves 2D, 3D, video generation, editing, and personalization. To scale generative experiences to hundreds of millions of users, Sergey’s team builds the world’s most efficient mobile foundational models, which enhance multiple products at Snap Inc. Sergey pioneered video generation and unsupervised animation domains with MoCoGAN, MonkeyNet, and the First Order Motion Model, sparking several startups in the field. His work on Interactive Video Stylization received the Best in Show Award at SIGGRAPH Real-Time Live! 2020. He has published over 60 top conference papers, journals, and patents, resulting in multiple innovative products, including Real-time Neural Lenses, real-time try-on, Snap AI Video, Imagine Together, world’s fastest foundational image-to-image model and many more. Before joining Snap Inc., Sergey was with Carnegie Mellon University, Microsoft, and NVIDIA. He holds a PhD from the University of Trento, Italy.
Title: Three and a Half Generations of Video Generation Models
Abstract: In the last decade, video generation research has progressed through several transformative phases. The earliest approaches—Generation 0—extended the earliest image generation models temporally. While these models achieved impressive results in domain-specific applications, they fell short of solving the general text-to-video problem.
The breakthrough success of large diffusion models in image synthesis and editing brought Generation 1 of video models. These approaches incorporated temporal layers into diffusion-based im- age architectures, significantly improving output quality. However, because these models lacked an explicit understanding of the time axis, they often produced visual artifacts.
Today, we are firmly in Generation 2, where videos are treated as first-class citizens. These models leverage spatio-temporal autoencoders to convert videos into compact latent spaces, suitable for denoising-based generation—typically powered by large trans- former architectures. Generation 2 has demonstrated remarkable improvements in quality, prompt adherence, and controllability.
Generation 3 is currently emerging. While no unified framework has yet been established, the limitations of Generation 2 are well understood. First, current video representations are simplistic and fail to exploit temporal redundancy. Second, videos are still generated in bulk, often requiring hours to complete. The goal is real-time or faster-than-real-time generation. Ongoing research aimed at solving these issues constitutes what we refer to as Generation 3.5.
In this talk, I will explore the conceptual and technical evolution of video generation models, highlight the distinguishing features of each generation, and discuss promising directions for future research.
©Copyright. All rights reserved.
We need your consent to load the translations
We use a third-party service to translate the website content that may collect data about your activity. Please review the details in the privacy policy and accept the service to view the translations.