[공지사항]을 빙자한 안부와 근황
Show more
/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
A Roadmap for Climate-Relevant Robotics Research
Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening
MMOne: Representing Multiple Modalities in One Scene
SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
(Almost) Free Modality Stitching of Foundation Models
A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection
THOR: Transformer Heuristics for On-Demand Retrieval
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos
Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model
Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling
VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
ReCode: Updating Code API Knowledge with Reinforcement Learning
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Multiple-Frequencies Population-Based Training
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
GPU Performance Portability needs Autotuning
Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID
Coral Protocol: Open Infrastructure Connecting The Internet of Agents
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence
ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
KP Quantum Neural Networks
VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models
Data-Efficient Deep Operator Network for Unsteady Flow: A Multi-Fidelity Approach with Physics-Guided Subsampling
Learning Universal Human Mobility Patterns with a Foundation Model for Cross-domain Data Fusion
GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for Dynamic Legged Robotics
A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Multi-View Node Pruning for Accurate Graph Representation
V-Max: A Reinforcement Learning Framework for Autonomous Driving
Interpretable Transformation and Analysis of Timelines through Learning via Surprisability
AI Governance InternationaL Evaluation Index (AGILE Index) 2024
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Improving Transformer World Models for Data-Efficient RL
LLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation
SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks
Determination of galaxy photometric redshifts using Conditional Generative Adversarial Networks (CGANs)
Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
Out-of-Distribution Recovery with Object-Centric Keypoint Inverse Policy for Visuomotor Imitation Learning
Dataset resulting from the user study on comprehensibility of explainable AI algorithms
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization
Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles
Benchmarking Sub-Genre Classification For Mainstage Dance Music
Risks of ignoring uncertainty propagation in AI-augmented security pipelines
MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications with Retrieval Augmented Generation and Knowledge Graphs
Leveraging Quantum Superposition to Infer the Dynamic Behavior of a Spatial-Temporal Neural Network Signaling Model
Bounding the Worst-class Error: A Boosting Approach
TBDetector:Transformer-Based Detector for Advanced Persistent Threats with Provenance Graph
Machine Learning Systems: A Survey from a Data-Oriented Perspective
Aime: Towards Fully-Autonomous Multi-Agent Framework
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control
Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
NTRL: Encounter Generation via Reinforcement Learning for Dynamic Difficulty Adjustment in Dungeons and Dragons
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
BEARCUBS: A benchmark for computer-using web agents
Demystifying MuZero Planning: Interpreting the Learned Model
LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Imbalance in Balance: Online Concept Balancing in Generation Models
Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour
Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management
QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Voxtral
Merge Kernel for Bayesian Optimization on Permutation Space
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
Automating Steering for Safe Multimodal Large Language Models
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
VITA: Vision-to-Action Flow Matching Policy
$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation
Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection
Aligning Humans and Robots via Reinforcement Learning from Implicit Human Feedback
SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks
Prompt Injection 2.0: Hybrid AI Threats
Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
Load more
SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
Created by
Haebom
作者
Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh
概要
本論文は、ソフトウェアエンジニアリング分野で使用される既存のベンチマーク、特にSWE-benchデータセットのLimitationsを指摘し、これを解決するための新しいベンチマークであるSWE-MERAを提示します。 SWE-benchは、データ汚染問題(直接的な解決策の漏洩および不適切なテストケース)が深刻で信頼性が低下することを指摘し、SWE-MERAは実際のGitHub問題を自動的に収集し、厳格な品質検証を通じてこれらの問題を解決しようとしています。現在、約10,000の潜在的なタスクと300のサンプルを提供しています。 2024年9月から2025年6月までに収集された作業について、12を超える最新のLLMのパフォーマンスを評価しました。
Takeaways、Limitations
•
Takeaways:
◦
既存のSWE-benchデータセットのデータ汚染問題を明らかにし、新しいベンチマークの必要性を提示します。
◦
実際のGitHub問題を活用した実用的なベンチマークSWE-MERAを提案し、自動化されたデータ収集と品質検証パイプラインを構築します。
◦
様々な最新LLMの性能を比較評価し、モデルの差別性を示す。
◦
継続的に更新される動的なベンチマークによるソフトウェアエンジニアリングの分野におけるLLMの発展に貢献
•
Limitations:
◦
現在、10,000個の潜在的な作業のうち300個のサンプルのみが公開され、ベンチマークの規模が制限的である。
◦
SWE-MERAの品質検証プロセスの具体的な説明が不足している可能性があります。
◦
特定のコーディングエージェントに依存する評価結果になる可能性があります。
◦
GitHubの問題に基づくデータセットなので、特定の種類のソフトウェアエンジニアリングの問題に偏る可能性があります。
PDFを見る
Made with Slashpage