[공지사항]을 빙자한 안부와 근황
Show more
/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
A Roadmap for Climate-Relevant Robotics Research
Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening
MMOne: Representing Multiple Modalities in One Scene
SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
(Almost) Free Modality Stitching of Foundation Models
A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection
THOR: Transformer Heuristics for On-Demand Retrieval
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos
Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model
Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling
VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
ReCode: Updating Code API Knowledge with Reinforcement Learning
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Multiple-Frequencies Population-Based Training
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
GPU Performance Portability needs Autotuning
Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID
Coral Protocol: Open Infrastructure Connecting The Internet of Agents
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence
ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
KP Quantum Neural Networks
VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models
Data-Efficient Deep Operator Network for Unsteady Flow: A Multi-Fidelity Approach with Physics-Guided Subsampling
Learning Universal Human Mobility Patterns with a Foundation Model for Cross-domain Data Fusion
GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for Dynamic Legged Robotics
A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Multi-View Node Pruning for Accurate Graph Representation
V-Max: A Reinforcement Learning Framework for Autonomous Driving
Interpretable Transformation and Analysis of Timelines through Learning via Surprisability
AI Governance InternationaL Evaluation Index (AGILE Index) 2024
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Improving Transformer World Models for Data-Efficient RL
LLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation
SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks
Determination of galaxy photometric redshifts using Conditional Generative Adversarial Networks (CGANs)
Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
Out-of-Distribution Recovery with Object-Centric Keypoint Inverse Policy for Visuomotor Imitation Learning
Dataset resulting from the user study on comprehensibility of explainable AI algorithms
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization
Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles
Benchmarking Sub-Genre Classification For Mainstage Dance Music
Risks of ignoring uncertainty propagation in AI-augmented security pipelines
MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications with Retrieval Augmented Generation and Knowledge Graphs
Leveraging Quantum Superposition to Infer the Dynamic Behavior of a Spatial-Temporal Neural Network Signaling Model
Bounding the Worst-class Error: A Boosting Approach
TBDetector:Transformer-Based Detector for Advanced Persistent Threats with Provenance Graph
Machine Learning Systems: A Survey from a Data-Oriented Perspective
Aime: Towards Fully-Autonomous Multi-Agent Framework
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control
Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
NTRL: Encounter Generation via Reinforcement Learning for Dynamic Difficulty Adjustment in Dungeons and Dragons
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
BEARCUBS: A benchmark for computer-using web agents
Demystifying MuZero Planning: Interpreting the Learned Model
LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Imbalance in Balance: Online Concept Balancing in Generation Models
Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour
Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management
QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Voxtral
Merge Kernel for Bayesian Optimization on Permutation Space
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
Automating Steering for Safe Multimodal Large Language Models
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
VITA: Vision-to-Action Flow Matching Policy
$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation
Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection
Aligning Humans and Robots via Reinforcement Learning from Implicit Human Feedback
SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks
Prompt Injection 2.0: Hybrid AI Threats
Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
Load more
BEARCUBS: A benchmark for computer-using web agents
Created by
Haebom
作者
Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer
概要
本稿では、実際のWeb環境でWebエージェントの情報検索能力を評価するための新しいベンチマークであるBEARCUBSを紹介します。 BEARCUBSは111の情報ナビゲーションの質問で構成されており、既存のベンチマークとは異なり、実際のWebページを使用してさまざまなモーダル間の相互作用(ビデオの理解、3Dナビゲーションなど)が必要です。人間の実験の結果、質問の難易度は適切で(84.7%の精度)、最先端のWebエージェントは低い精度(最高23.4%)を示しました。これは、信頼できる情報源の選択と強力なダモダル能力の重要性を強調します。 BEARCUBSは継続的に更新され、Webエージェントの研究に貢献します。
Takeaways、Limitations
•
Takeaways:
◦
実際のWeb環境におけるWebエージェントのパフォーマンスを評価するための新しいベンチマークBEARCUBSの提示
◦
既存のベンチマークの限界を克服し、さまざまなモーダル間の相互作用の必要性を強調
◦
最先端のWebエージェントの性能低下の原因を明らかにし、向上方向を提示(信頼できるソース選択、強力なダモダル能力)
◦
Webエージェント研究のための継続的なベンチマークの提供
•
Limitations:
◦
現在、ベンチマークの質問数(111件)が比較的少ない場合があります。
◦
BEARCUBSの継続的な更新とメンテナンスが必要です。
◦
Web環境の変化に応じて、ベンチマークの適切性を継続的に検討する必要があります。
PDFを見る
Made with Slashpage