Markerless hand–eye calibration is crucial for achieving precise transformations between optical sensors and robots, especially in unstructured environments. However, monocular cameras, despite being cost-effective and requiring low computational resources, present challenges due to incomplete correspondence of projected coordinates. This work proposes a hand–eye calibration method in an eye-to-hand scenario, using rotation representations inferred through an enhanced autoencoder neural network. The approach combines relevant visual features, known three-dimensional geometry, and projected information to predict the robot's position and orientation from monocular RGB images, eliminating the need for physical markers. The solution analyzes latent spatial vectors generated during the autoencoding process, significantly improving orientation estimation compared to traditional machine learning models. The method is computationally efficient and operates in real time, demonstrating robustness against occlusions and lighting variations. It was validated in both simulated and real-world environments using an RGB-D sensor, evaluating translation and orientation errors through the registration of predicted and captured point clouds. The results outperformed traditional techniques based on chessboard markers, showing greater precision and adaptability. Although the method faces challenges such as ambiguous features and perspective distortions, these can be mitigated by refining predictions and integrating depth data. Additionally, its ability to identify multiple robots enables simultaneous calibrations, expanding its applicability in dynamic environments.