This paper presents a novel method for improving the complex problem-solving ability of large-scale language models (LLMs) through reinforcement learning. Conventional reinforcement learning requires verifiable reward signals, which are often costly and impractical in all domains. This study demonstrates that LLMs can utilize the asymmetry between generation and validation to self-judge and improve without a reference solution. By implementing self-judgement using countdown puzzles and integration problems, we achieve performance comparable to conventional validation methods. Specifically, the Qwen 2.5 7B DeepSeek Distilled model trained with self-reward achieved performance comparable to that achieved in the MIT Integration Bee competition. Combined with synthetic problem generation, we establish a complete self-improvement loop where the model generates, solves, and evaluates problems on its own. This demonstrates that reinforcement learning can be applied in numerous domains previously limited by the difficulty of reward design. This represents a significant step toward autonomous AI systems that continuously improve through self-directed learning without human intervention.