Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Created by
  • Haebom

Author

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Kr ahenb uhl, Piotr Doll ar, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer

Outline

This paper presents a study on building a Perceptual Language Model (PLM) within a completely open and reproducible framework for the study of vision-language models essential for computer vision research. Without distilling from proprietary models, we analyze standard training pipelines and leverage large-scale synthetic data to identify a critical data gap, especially in detailed video understanding. To address this gap, we release 2.8 million sophisticated video question-answer pairs and human-labeled instances of spatiotemporally based video captions. We also introduce a suite of evaluation tools called PLM-VideoBench to evaluate challenging video understanding tasks that focus on the ability to infer the “what,” “where,” “when,” and “how” of a video. We provide data, training recipes, code, and models to ensure full reproducibility of the task.

Takeaways, Limitations

Takeaways:
Presenting an open and reproducible vision-language model research framework that does not rely on proprietary models
Large-scale human-labeled dataset (2.8 million video question-answer pairs and captions) released
Introducing PLM-VideoBench, a new evaluation tool for video understanding
Analyzing data gaps and suggesting solutions through synthetic data utilization
Limitations:
Due to limitations in synthetic data, it may not fully reflect the complexity of real data.
PLM-VideoBench may have limited evaluation scope
Even if it is an open model, there may be researchers who have difficulty securing reproducibility due to the complexity of the model.
👍