In this paper, we introduce an innovative temporal consistency enhancement approach, which enables image-based models on video data by leveraging a deep-learning-based Kalman Filter. More specifically, we propose a novel Bi-direction Kalman filter strategy, utilizing forward and backward processing to capitalize on higher-quality pose estimations near the camera, enhancing the robustness and precision of vehicle tracking across varying distances and conditions. Then, rather than using the conventional mathematical motion model, we propose a learnable motion model, dubbed Future State Predictor, to represent the complex, non-linear motion patterns observed in vehicles. The experimental results demonstrate that our approach enhances pose accuracy and temporal consistency, which allows us to handle the challenging occluded/distant vehicles.