This paper provides a simple method for representation learning in an RL environment. Suppose an agent is deployed in the environment (where, in the paper, states are given as images). Their agent learns a representation of the state via Contrastive Learning. The agent also learns a model of the environment: given the current state and an action, it tries to predict what state that leads to (in embedding space). Then, their curiosity-based objective chooses an action to maximize the contrastive learning loss. Both are trained adversarially: the agent is minimizing the contrastive loss, and the ``adversarial guide'' is choosing actions to maximize it.
This is pretty cool, simple, and similar to ViewMaker, but the ``views'' here are the actions that are available in the environment, whereas ViewMaker learns bounded image perturbations.