/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants
Distributional Soft Actor-Critic with Diffusion Policy
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Fast AI Model Splitting over Edge Networks
From Sentences to Sequences: Rethinking Languages in Biological System
MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound
Horus: A Protocol for Trustless Delegation Under Uncertainty
Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop
Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration
Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
Distinguishing Predictive and Generative AI in Regulation
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
Text-Aware Image Restoration with Diffusion Models
How Good LLM-Generated Password Policies Are?
Towards an Explainable Comparison and Alignment of Feature Embeddings
Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification
Empowering Intelligent Low-altitude Economy with Large AI Model Deployment
Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Threat Modeling for AI: The Case for an Asset-Centric Approach
SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings
PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification
Significativity Indices for Agreement Values
Transferrable Surrogates in Expressive Neural Architecture Search Spaces
Privacy-Preserving Operating Room Workflow Analysis using Digital Twins
Uncertainty-Guided Coarse-to-Fine Tumor Segmentation with Anatomy-Aware Post-Processing
CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition
Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Understanding-informed Bias Mitigation for Fair CMR Segmentation
HAPI: A Model for Learning Robot Facial Expressions from Human Preferences
MaizeField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel
Illuminant and light direction estimation using Wasserstein distance method
Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
LLM-Powered Prediction of Hyperglycemia and Discovery of Behavioral Treatment Pathways from Wearables and Diet
Interleaved Gibbs Diffusion: Generating Discrete-Continuous Data with Implicit Constraints
EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks
Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks
EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference
Learning Traffic Anomalies from Generative Models on Real-Time Observations
Enabling Population-Level Parallelism in Tree-Based Genetic Programming for Comprehensive GPU Acceleration
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Quantifying the Importance of Data Alignment in Downstream Model Performance
Quantum-enhanced causal discovery for a small number of samples
On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework
GeMID: Generalizable Models for IoT Device Identification
Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
Is Complex Query Answering Really Complex?
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning
Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling
Reconsidering the energy efficiency of spiking neural networks
Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes
Sequence-aware Pre-training for Echocardiography Probe Movement Guidance
Anatomical Foundation Models for Brain MRIs
Learning From Crowdsourced Noisy Labels: A Signal Processing Perspective
Quantifying the Cross-sectoral Intersecting Discrepancies within Multiple Groups Using Latent Class Analysis Towards Fairness
Delving into LLM-assisted writing in biomedical publications through excess vocabulary
Towards a Novel Measure of User Trust in XAI Systems
Avoiding Catastrophe in Online Learning by Asking for Help
Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data
Kernel Density Bayesian Inverse Reinforcement Learning
Embodied AI Agents: Modeling the World
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
AI Flow: Perspectives, Scenarios, and Approaches
A framework for Conditional Reasoning in Answer Set Programming
Autoformalization in the Era of Large Language Models: A Survey
Agentic AI Process Observability: Discovering Behavioral Variability
Artificial Intelligence Index Report 2025
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science
XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
Direct Preference Optimization Using Sparse Feature-Level Constraints
Unsupervised Cognition
Urban Region Pre-training and Prompting: A Graph-based Approach
Road Graph Generator: Mapping roads at construction sites from GPS データ
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Subtyping in DHOL - Extended preprint
MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network
DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift
SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Multi-agent Auditory Scene Analysis
Fast and Simplex: 2-Simplicial Attention in Triton
Synthesizable by Design: A Retrosynthesis-Guided Framework for Molecular Analog Generation
Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics
Early Signs of Steganographic Capabilities in Frontier LLMs
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models
APT: Adaptive Personalized Training for Diffusion Models with Limited Data
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Load more
Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data
Created by
Haebom
作者
Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo
概要
この論文では、大規模言語モデル(LLM)の事前学習におけるデータ品質、特にデータの多様性を定量的に測定する方法について説明します。従来のLLM事前学習研究は主にモデルとデータセットサイズの拡張に焦点を当ててきましたが、データ品質の重要性は明確に定義されていません。研究者は「多様性係数」という尺度を提案して自然言語データの多様性を測定し、公的に利用可能な事前学習データセットの多様性を測定します。 GPT-2とLLaMAv2を用いた様々な規模(51M~7Bパラメータ)のモデル(合計44個)を対象とした実験により、提案された多様性係数が下流モデル評価性能と相関関係があることを示しています。結論として、多様性係数はデータ品質の重要な側面であり、データの多様性とモデル性能の向上との間の因果関係を捉える。
Takeaways、Limitations
•
Takeaways:
◦
LLM事前学習データの多様性を定量的に測定する新しい指標(多様性係数)を提示します。
◦
多様性係数がLLMの下流作業性能と密接に関連していることを実験的に証明した。
◦
データ品質を向上させるための新しい方向性を提示します。
◦
さまざまなサイズのモデルで一貫した結果を示します。
•
Limitations:
◦
多様性係数がデータ品質のあらゆる側面をカバーするわけではない可能性があります。 (多様性以外の要因を考慮する必要性)
◦
特定のデータセットとモデルの実験結果であるため、一般化の可能性に関するさらなる研究が必要です。
◦
多様性係数の計算コストが大きくなる可能性があります。
◦
多様性係数を最適化するデータセット生成方法のさらなる研究が必要
PDFを見る
Made with Slashpage