This paper reveals a tradeoff between generating error-prone unit test inputs while accurately predicting unit test outputs without a correct answer. To address this, we propose UTGen, which trains LLMs to generate error-prone unit test inputs and correct expected outputs based on task descriptions. Since model-generated tests can contain noise, we improve UT output predictions by leveraging test-time calculations via UTDebug. Furthermore, we verify and backtrack edits based on multiple generated UTs to prevent overfitting and effectively support LLM debugging. Experimental results show that UTGen outperforms other LLM-based baseline models by 7.59% on metrics measuring both error-prone UT inputs and correct UT outputs. When combined with UTDebug, it improves the pass@1 accuracy of Qwen2.5 32B by 3.17% and 12.35%, respectively, on more challenging debugging partitions of HumanEvalFix and MBPP+ compared to other LLM-based UT generation baseline models. Furthermore, feedback from the UTGen model based on Qwen2.5 32B improved the debugging performance of state-of-the-art LLMs, such as GPT-4o, by 13.8%. Finally, UTGen demonstrates that using Qwen2.5 7B with the best 10 samples from HumanEval+, it outperforms the state-of-the-art 8B reward model by 4.43% in determining code correctness.