Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space

Created by
  • Haebom

Author

Yang Li, Zhi Chen

Outline

This paper presents a novel generative model, ByteGen, to address the challenging problem of generative modeling of high-frequency order book (LOB) dynamics. Existing approaches suffer from limitations due to their reliance on simplified probabilistic assumptions or, in the case of modern deep learning models like Transformer, tokenization techniques that affect the high-precision numerical properties of the data. ByteGen overcomes these limitations by directly processing the raw byte stream of LOB events. To represent market messages without information loss, we design a 32-byte compressed binary format and address the problem with an autoregressive next-byte prediction task. By completely eliminating feature engineering and tokenization, we learn market dynamics from a basic representation. By applying the H-Net architecture, we utilize a dynamic chunking mechanism to discover the inherent structure of market messages without predefined rules. By training on over 34 million events from CME Bitcoin futures, we successfully reproduce key features of financial markets, including realistic price distributions, heavy-tail returns, and burst event timing.

Takeaways, Limitations

Takeaways:
Presenting the first end-to-end byte-level framework for LOB modeling.
Proposing an efficient compressed data representation method.
Achieve competitive performance on standard market quality metrics without tokenization bias.
We demonstrate that learning directly in byte space is a promising and flexible paradigm for modeling complex financial systems.
Limitations:
Currently, only the results for CME Bitcoin futures data are presented, and further research is needed to determine generalizability to other assets or markets.
The lack of a detailed description of the dynamic chunking mechanism of the H-Net architecture necessitates verification of reproducibility.
Lack of analysis of the model's scalability and computational cost.
👍