Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Is In-Context Learning Learning?

Towards Understanding Visual Grounding in Visual Language Models

Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss

GeoGPT-RAG Technical Report

Character-Level Perturbations Disrupt LLM Watermarks

STRIDE: Subset-Free Functional Decomposition for XAI in Tabular Settings

Implicit Neural Representations of Intramyocardial Motion and Strain

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework

QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

Exploit Tool Invocation Prompt for Tool Behavior Hijacking in LLM-Based Agentic System

From Vision to Validation: A Theory- and Data-Driven Construction of a GCC-Specific AI Adoption Index

AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

INGRID: Intelligent Generative Robotic Design Using Large Language Models

From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks

Binary Quantization For LLMs Through Dynamic Grouping

E-PhishGen: Unlocking Novel Research in Phishing Email Detection

EndoGeDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper

Can AI be Auditable?

Principled Approximation Methods for Efficient and Scalable Deep Learning

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

Quantized Neural Networks for Microcontrollers: A Comprehensive Review of Methods, Platforms, and Applications

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

CRoC: Context Refactoring Contrast for Graph Anomaly Detection with Limited Supervision

MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction

A Mixed User-Centered Approach to Enable Augmented Intelligence in Intelligent Tutoring Systems: The Case of MathAIde app

Task-Focused Consolidation with Spaced Recall: Making Neural Networks Learn like College Students

NIRS: An Ontology for Non-Invasive Respiratory Support in Acute Care

A Human-Centered Approach to Identifying Promises, Risks, & Challenges of Text-to-Image Generative AI in Radiology

Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition

Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications

Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

Intrinsic Training Signals for Federated Learning Aggregation

Critical Nodes Identification in Complex Networks: A Survey

Low-rank variational dropout: uncertainty and rank selection in adapters

LastingBench: Defend Benchmarks Against Knowledge Leakage

Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Topology-Aware and Highly Generalizable Deep Reinforcement Learning for Efficient Retrieval in Multi-Deep Storage Systems

The Diffusion Duality

Learning Chaotic Dynamics with Neuromorphic Network Dynamics

High-Fidelity Scientific Simulation Surrogates via Adaptive Implicit Neural Representations

Survey on the Evaluation of Generative Models in Music

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Hopscotch: Discovering and Skipping Redundancies in Language Models

HueManity: Probing Fine-Grained Visual Perception in MLLMs

From Initial Data to Boundary Layers: Neural Networks for Nonlinear Hyperbolic Conservation Laws

Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs

A Convolution and Attention Based Encoder for Reinforcement Learning under Partial Observability

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Semantic Exploration and Dense Mapping of Complex Environments using Ground Robots Equipped with LiDAR and Panoramic Camera

REMS: a unified solution representation, problem modeling and metaheuristic algorithm design for general combinatorial optimization problems

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Balanced and Elastic End-to-end Training of Dynamic LLMs

Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Multilingual Collaborative Defense for Large Language Models

An Exploration of Default Images in Text-to-Image Generation

Burger: Robust Graph Denoising-augmentation Fusion and Multi-semantic Modeling in Social Recommendation

Fast Fourier Transform-Based Spectral and Temporal Gradient Filtering for Differential Privacy

Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Approaches to Responsible Governance of GenAI in Organizations

MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark

EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

Dion: Distributed Orthonormalized Updates

Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use

Hallucinated Span Detection with Multi-View Attention Features

Enhancing Traffic Incident Response through Sub-Second Temporal Localization with HybridMamba

Automated detection of atomicity violations in large-scale systems

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

YuE: Scaling Open Foundation Models for Long-Form Music Generation

On the Generalization of Representation Uncertainty in Earth Observation

SMT(LIA) Sampling with High Diversity

Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

LLM as a Broken Telephone: Iterative Generation Distorts Information

Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology

FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Created by

Haebom

Author

Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

Outline

This paper proposes REWIRE (REcycling the Web with guided REwrite), a method that recycles low-quality web text discarded during conventional filtering processes to address the "data wall" problem, a challenge faced by large-scale language models. REWIRE rewrites low-quality documents to make them useful for training and expands the pre-training dataset by increasing the proportion of synthetic data. Experimental results on DCLM benchmarks at 1B, 3B, and 7B scales demonstrate that models trained using a mixture of high-quality original text and rewritten text outperform models using only filtered web data by 1.0%, 1.3%, and 2.5%, respectively, demonstrating greater performance than models trained using twice the amount of web data. Analysis reveals that approximately 82% of the mixed text is derived from previously discarded low-quality documents, outperforming other synthetic data generation methods such as Wikipedia-style paraphrasing, question-answer synthesis, and knowledge extraction. Therefore, reusing web text suggests a simple and effective method for pre-training data expansion. High-quality synthetic data are available in https://huggingface.co/datasets/facebook/recycling_the_web .

facebook/recycling_the_web · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Takeaways, Limitations

•

Takeaways:

◦

We propose a potential solution to the problem of securing data required for large-scale language model training by reusing low-quality web data discarded by existing filtering processes.

◦

We experimentally demonstrate that the REWIRE technique can effectively generate synthetic data to expand the size of the pre-training dataset and improve model performance.

◦

We demonstrate the effectiveness of REWIRE by outperforming other existing synthetic data generation methods.

◦

We make the generated high-quality synthetic datasets publicly available to enable other researchers to utilize them.

•

Limitations:

◦

REWIRE's performance improvements are for a specific benchmark (DCLM) and a specific model size, and do not guarantee the same performance improvements for other benchmarks or model sizes.

◦

There is a lack of in-depth analysis of the biases and errors that can arise during the process of converting low-quality data into high-quality data.

◦

There is a lack of analysis on the computational costs of the data reuse process. Further research is needed to determine how cost-effective the rewriting process itself is compared to the cost of pretraining.

View PDF

Made with Slashpage