Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark

MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3D+T Biological Tissue Dynamics

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

Test-time Corpus Feedback: From Retrieval to RAG

$\Mathrm{TIME}[t]\subseteq \mathrm{SPACE}[O(\sqrt{t})]$ via Tree Height Compression

ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students’ Cognitive Abilities

Documenting Deployment with Fabric: A Repository of Real-World AI Governance

High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images

Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches

Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis

Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention

Fourier-Guided Attention Upsampling for Image Super-Resolution

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

Neural Logic Networks for Interpretable Classification

DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models

Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Leveraging GNN to Enhance MEF Method in Predicting ENSO

Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

SGD Convergence under Stepsize Shrinkage in Low-Precision Training

CLAP: Coreference-Linked Augmentation for Passage Retrieval

Geometry-Aware Spiking Graph Neural Network

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Enhancing material behavior discovery using embedding-oriented Physically-Guided Neural Networks with Internal Variables

Agentic large language models improve retrieval-based radiology question answering

PARROT: An Open Multilingual Radiology Reports Dataset

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Combining Cost-Constrained Runtime Monitors for AI Safety

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks

FlexOlmo: Open Language Models for Flexible Data Use

Multi-Level Fusion Graph Neural Network for Molecule Property Prediction

BiMark: Unbiased Multilayer Watermarking for Large Language Models

A foundation model with multi-variate parallel attention to generate neuronal activity

Effective Red-Teaming of Policy-Adherent Agents

From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis

LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Accountability Attribution: Tracing Model Behavior to Training Processes

Equivariant Spherical Transformer for Efficient Molecular Modeling

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate

Security Concerns for Large Language Models: A Survey

Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection

Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Learning with Spike Synchrony in Spiking Neural Networks

Explainable Prediction of the Mechanical Properties of Composites with CNNs

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

RT-Cache: Training-Free Retrieval for Real-Time Manipulation

DSADF: Thinking Fast and Slow for Decision Making

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

ICQuant: Index Coding enables Low-bit LLM Quantization

Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Theory of Mind in Large Language Models: Assessment and Enhancement

SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features

VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

CLaP -- State Detection from Time Series

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios

ImF: Implicit Fingerprint for Large Language Models

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

Investigating the Robustness of Deductive Reasoning with Large Language Models

Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

Evaluation of Large Language Models via Coupled Token Generation

Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities

HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

Created by

Haebom

Author

Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Jean-S ebastien Franco, Gr egory Rogez

Outline

This paper addresses hand-object 3D reconstruction, a growing topic in applications such as human-robot interaction and immersive AR/VR experiences. Conventional approaches for object-agnostic hand-object reconstruction from RGB sequences involve a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques such as SfM and hand keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusion, limiting scalability and generalizability. As a key element for general, smooth, and non-intrusive applicability, this study proposes a robust, keypoint-detector-free approach for estimating hand-object 3D transformations from monocular motion videos/images. Furthermore, by integrating this approach with a multi-view reconstruction pipeline, we accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained and does not rely on pre-scanned object templates or internal camera parameters. It achieves state-of-the-art performance on the SHOWMe benchmark for object-agnostic hand-to-object 3D transformation and shape estimation. We also demonstrate generalization to unseen object categories by conducting experiments on sequences from the HO3D dataset.

Takeaways, Limitations

•

Takeaways:

◦

A robust hand-object 3D reconstruction method that does not rely on keypoint detectors is presented.

◦

Works without pre-scanned object templates or internal camera parameters

◦

Achieving cutting-edge performance in the SHOWMe benchmark

◦

Checking generalizability to categories of invisible objects

•

Limitations:

◦

The paper does not explicitly address the specific Limitations. Further experiments and analysis are needed to identify the Limitations. For example, potential performance degradation due to specific lighting conditions or object complexity should be addressed through further research.

View PDF

Made with Slashpage