This paper proposes a general framework for speech transformation that provides robust control and interpretability over data attribute manipulation. Compared to existing empirical approaches to voice style conversion, this study provides theoretical analysis and guarantees. This framework is based on a non-probabilistic autoencoder architecture and imposes independence constraints between predicted latent variables and controllable target variables. This design allows for consistent signal transformation and targeted attribute modification based on observed style variables while preserving the original content. Experiments on various voice styles, such as speaker identity and emotion, demonstrate the effectiveness and generality of the proposed method.