This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Hulk is the first multimodal human-centered generalization model capable of handling diverse human-centered perceptual tasks, including 2D and 3D vision, skeletal-based, and vision-language tasks. Existing human-centered models have limitations, such as their inability to handle 3D and vision-language tasks and the need for task-specific fine-tuning. To address these challenges, Hulk integrates diverse task-specific heads into two general heads: one for discrete representations (e.g., language) and one for continuous representations (e.g., coordinates). This unified representation allows Hulk to handle diverse human-centered tasks with modality transformation and integrate knowledge across a wide range of tasks. A comprehensive evaluation on 12 benchmarks covering eight human-centered tasks demonstrates the superiority of the proposed method, achieving state-of-the-art performance on 11 benchmarks. The code is available at https://github.com/OpenGVLab/Hulk .
We present the first multi-modality model capable of handling diverse human-centric perception tasks (2D/3D vision, skeleton-based, and vision-language) without task-specific fine-tuning.
◦
Unified representation through two common heads enables knowledge integration and modality conversion across various tasks.
◦
Achieved state-of-the-art performance in 11 out of 12 benchmarks.
◦
Expanding research and increasing usability through open source disclosure.
•
Limitations:
◦
Generalization performance verification is needed for tasks other than the currently presented benchmarks.
◦
Further analysis of the model's size and computational cost is needed.
◦
Further research is needed to optimize performance for specific tasks.