Offline reinforcement learning represents a pivotal area of advancement within the broader realm of reinforcement learning. Its central objective is to train an agent exclusively using behavioral data, eliminating any need for online interaction. However, relying solely on insights from offline datasets can often lead to ineffective solutions, primarily due to the mismatching between the learned policy's understanding and the actual underlying environment. Recent research efforts have tended to approach this challenge with an overly pessimistic mindset, potentially compromising the agent's robustness when encountering unseen states. We introduce a self-supervised framework tailored to mitigate this issue. Drawing inspiration from contrastive-based techniques in self-supervised learning, we treat original data as positive samples and generate synthetic data from highly uncertain regions as negative samples. To simulate these regions, we employ modified Generative Adversarial Networks (GANs) to produce samples that mirroring the distribution of previous experiences and introducing a significant degree of uncertainty in terms of behavioral policy at the same time. To bolster the policy's robustness, we impose penalties for overconfident behavior when dealing with negative data. Comprehensive experiments conducted on multiple public offline reinforcement learning benchmarks have highlighted the practicality and efficacy of our framework.
|