This paper addresses hand-object 3D reconstruction, a growing topic in applications such as human-robot interaction and immersive AR/VR experiences. Conventional approaches for object-agnostic hand-object reconstruction from RGB sequences involve a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques such as SfM and hand keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusion, limiting scalability and generalizability. As a key element for general, smooth, and non-intrusive applicability, this study proposes a robust, keypoint-detector-free approach for estimating hand-object 3D transformations from monocular motion videos/images. Furthermore, by integrating this approach with a multi-view reconstruction pipeline, we accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained and does not rely on pre-scanned object templates or internal camera parameters. It achieves state-of-the-art performance on the SHOWMe benchmark for object-agnostic hand-to-object 3D transformation and shape estimation. We also demonstrate generalization to unseen object categories by conducting experiments on sequences from the HO3D dataset.