SPIE Journal Paper | 10 August 2024
Shengxue Cao, Biao Zhu, Keyan Kong, Lixiang Ma, Bingyou Liu
KEYWORDS: Performance modeling, Optical tracking, Education and training, Data modeling, Video, Detection and tracking algorithms, Network architectures, Deformation, Ablation, Head
A challenge in video tracking is effectively utilizing temporal correlation to navigate the complexity of a tracking scene without imposing additional computational requirements. Different from trackers that rely on online update templates, our methodology introduces a concise and efficient Siamese tracker, which assimilates temporal information via an offline strategy, named SiamGPF. Given the potential for a target to display varied appearances over time, a dependence solely on static template information proves inadequate for addressing extreme deformations and occlusions. Accordingly, we integrate the historical state of the target as pseudo-priori information into our framework, using global and local cross-correlation to fuse features accurately. Furthermore, we introduce a state fusion module to enhance the interplay of information across disparate temporal points of the target. In the inference phase, we implement a cost-effective prior frame dynamic selector, adept at eliminating low-quality target information, thereby substantially augmenting the robustness of the tracker. Our tracker exhibits competitive performance across prominent tracking benchmark datasets (OTB100, LaSOT, LaSOTExt, UAV123, NFS, TC128, GOT-10k, and VOT2016), with the official GOT-10k server evaluation results demonstrating an impressive operational speed of 212 FPS.