/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Ada-TransGNN: An Air Quality Prediction Model Based On Adaptive Graph Convolutional Networks
Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery
Consistent Opponent Modeling of Static Opponents in Imperfect-Information Games
Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes
Agentic AI for Software: thoughts from Software Engineering community
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling
A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems
Generative Artificial Intelligence and Agents in Research and Teaching
CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression
Comparative Analysis of UAV Path Planning Algorithms for Efficient Navigation in Urban 3D Environments
Retrieval Enhanced Feedback via In-context Neural Error-book
From Confidence to Collapse in LLM Factual Robustness
On Task Vectors and Gradients
Learning in Repeated Multi-Objective Stackelberg Games with Payoff Manipulation
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
DLLMQuant: Quantizing Diffusion-based Large Language Models
LLM-Enhanced Linear Autoencoders for Recommendation
Leveraging GNN to Enhance MEF Method in Predicting ENSO
Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation
New Kid in the Classroom: Exploring Student Perceptions of AI Coding Assistants
Large Language Model-Based Framework for Explainable Cyberattack Detection in Automatic Generation Control Systems
SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
Apple Intelligence Foundation Language Models: Tech Report 2025
SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models
Demographic-aware fine-grained classification of pediatric wrist fractures
Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Solar Altitude Guided Scene Illumination
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion
RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing
Concept-Guided Interpretability via Neural Chunking
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
Video CLIP Model for Multi-View Echocardiography Interpretation
A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Noise-based reward-modulated learning
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
TableTalk: Scaffolding Spreadsheet Development with a Language Agent
StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
Provably-Safe Neural Network Training Using Hybrid Zonotope Reachability Analysis
Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
Safe Multiagent Coordination via Entropic Exploration
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use
Cultural Dimensions of AI Perception: Charting Expectations, Risks, Benefits, Tradeoffs, and Value in Germany and China
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI
Hierarchical Object-Oriented POMDP Planning for Object Rearrangement
From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification
Secure Reinforcement Learning via Shuffle Privacy Model
Overcoming label shift with target-aware federated learning
Benchmarking XAI Explanations with Human-Aligned Evaluations
HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation
Leveraging Multi-facet Paths for Heterogeneous Graph Representation Learning
GeNet: A Multimodal LLM-Based Co-Pilot for Network Topology and Configuration
ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL
Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis
Learning county from pixels: corn yield prediction with attention-weighted multiple instance learning
Memory augment is All You Need for image restoration
Rethinking Distribution Shifts: Empirical Analysis and Inductive Modeling for Tabular Data
DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models
Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles
Bayesian Deep Learning for Segmentation for Autonomous Safe Planetary Landing
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots
Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld
Multi-Agent LLMs as Ethics Advocates for AI-Based Systems
Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
MRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models
Consensus in Motion: A Case of Dynamic Rationality of Sequential Learning in Probability Aggregation
Can Large Language Models Act as Ensembler for Multi-GNNs?
Pessimistic Iterative Planning with RNNs for Robust POMDPs
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding
Integrating Large Language Model for Improved Causal Discovery
A Survey on Causal Discovery: Theory and Practice
Generative Interfaces for Language Models
Interpolating Speaker Identities in Embedding Space for Data Expansion
VibeVoice Technical Report
LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding
Understanding Tool-Integrated Reasoning
Emotions as Ambiguity-aware Ordinal Representations
Real-Time Model Checking for Closed-Loop Robot Reactive Planning
Load more
SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
Created by
Haebom
作者
Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang
概要
この論文は、大規模言語モデル(LLM)の構造化された知識(SK)理解能力を評価するための新しいベンチマークであるSKA-Benchを提案します。 SKA-Benchには、知識グラフ(KG)、表、KG+テキスト、表+テキストなど4つのタイプのSKが含まれています。 LLMのSK理解能力を細かく評価するために、ノイズに対する堅牢性、順序に対する無関心性、情報統合能力、否定的な情報拒否能力など、4つの基本的な能力テストベッドを拡張して使用します。 8つの代表的なLLMを対象に実験した結果、既存のLLMは構造化された知識の理解に依然としてかなりの困難を抱えており、性能はノイズの量、知識単位の順序、幻覚現象などの要因に影響を受けることを示しています。データセットとコードはFitHubで公開されています。
Takeaways、Limitations
•
Takeaways:
◦
LLMの構造化された知識を理解するための包括的で厳格な評価ベンチマークを提供します。
◦
さまざまな種類の構造化された知識を包括的に扱うことで、LLMの弱点を正確に診断できます。
◦
LLMの構造化された知識理解能力の詳細な分析を可能にします。
◦
既存のLLMの構造化された知識理解能力の限界を明確に提示します。
•
Limitations:
◦
現在、ベンチマークに含まれているLLMの種類は限られている可能性があります。
◦
SKA-Benchの性能評価指標と測定方法のさらなる研究が必要になるかもしれません。
◦
特定の種類の構造化された知識に偏りがある可能性があります。
PDFを見る
Made with Slashpage