This paper addresses the limitations of the existing Least-Recently-Used (LRU) cache replacement policy used in web proxies such as NGINX and proposes Cold-RL, a novel reinforcement learning-based replacement policy. Cold-RL utilizes an ONNX sidecar to implement a dual-ring Deep Q-Network (DQN) to make cache replacement decisions under strict time constraints of less than 500 microseconds. Six lightweight features are extracted from cache objects to select replacement targets: age, size, hit count, inter-arrival time, remaining TTL, and last original RTT. Training is performed in a simulation environment by replaying NGINX access logs. Experimental results show that Cold-RL outperforms existing LRU, LFU, size-based, adaptive LRU, and hybrid techniques across various cache sizes. Specifically, it demonstrates a 146% improvement at small cache sizes (25 MB) and similar performance to existing techniques at large cache sizes (400 MB). The CPU overhead during the inference process is less than 2%, and the 95th percentile latency is maintained within 500 microseconds.