This paper proposes BitDecoding, a novel inference system that leverages low-bit KV caches to address the increasing memory and bandwidth demands of long-context large-scale language models (LLMs) inference. BitDecoding enables efficient low-bit KV cache decoding by cooperatively leveraging CUDA cores and Tensor Cores. It includes techniques such as automatically deriving optimized layouts for Tensor Core utilization and dequantization via warp-level parallelization strategies. It also provides unified system support through a query transformation module that supports various attention variants, a high-performance quantization kernel that supports tensor-wise and channel-wise scaling used in various quantization algorithms, and a dequantization kernel with a software-defined pipeline that coordinates CUDA and Tensor Core execution. Evaluations on RTX 4090, A100, and H100 demonstrate that BitDecoding delivers up to 7.5x, 4.8x, and 8.9x decoding speedups compared to FP16 FlashDecoding-v2, and outperforms the state-of-the-art low-bitrate system QServe by up to 4.3x. Significant improvements are also seen for long-context generation, including up to a 3x reduction in single-batch decoding latency on LLaMA-3.1-8B with 128K contexts. The code is available on GitHub.