Optical and acoustic stereo imaging has great potential for the precise and consistent localization of intervention underwater robots; however, it is still being explored due to its sensing limitations and various technical challenges. This study presents a novel localization method by combining an inertial navigation system and an optical and acoustic stereo imaging system. As a strategy for localization correction relative to underwater structures, the robot’s pose is estimated based on a single acoustic image using a sonar simulator for mid-range localization, and a robust visual tracking using a 3-D wireframe model is employed for high-precision localization near the target structures. The performance of the proposed technique was demonstrated through experimental validation using real data obtained from a test tank.