This paper presents an innovative system for dynamic 3D scene perception tailored for battlefield environments, leveraging unmanned intelligent agents equipped with binocular vision and inertial measurement units (IMUs). The system processes binocular video streams and IMU data to deploy advanced deep learning techniques, including instance segmentation and dense optical flow prediction, facilitated by a specially curated target dataset. The integration of a ResNet101+FPN backbone for model training results in a combat unit type recognition accuracy of 91.8%, a mean Intersection over Union (mIoU) of 0.808, and a mean Average Precision (mAP) of 0.6064. A dynamic scene localization and perception module utilizes these deep learning outputs to refine pose estimations and enhance localization accuracy by overcoming environmental complexities and motioninduced errors, typically associated with SLAM methodologies. Application tests conducted within a simulated battlefield metaverse environment demonstrate a 44.2% improvement in self-localization accuracy over traditional ORB-SLAM2 Stereo methods. The system efficiently tracks and annotates dynamic and static battlefield elements, continuously updating global maps with precise data on agent poses and target movements. This work not only addresses the dynamic complexity and potential information loss in battlefield scenarios but also lays a foundational framework for future enhancements in network capabilities and environment reconstruction methodologies. Future developments will focus on precise identification of combat unit models, multiagent collaboration, and the application of 3D scene perception to advance real-time decision-making and tactical planning in joint combat scenarios. This approach holds significant potential for enriching the battlefield metaverse, fostering deep human-machine interaction, and guiding practical military applications.