KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Egor Cherepanov; Daniil Zelezetsky; Alexey K. Kovalev; Aleksandr I. Panov

Research project page

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

KAGE-Bench is a fast JAX RL benchmark that isolates known-axis visual shifts to probe pixel-RL generalization

Egor Cherepanov^1,2, Daniil Zelezetsky¹, Alexey K. Kovalev^1,2, Aleksandr I. Panov^1,2

¹MIRIAI ²Cognitive AI Systems Lab

arXiv Code BibTeX

Demos (only part of what is in the benchmark!)

Abstract

Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors.

Introduction

Pixel-based reinforcement learning agents often fail under purely visual distribution shifts, even when underlying dynamics and rewards remain unchanged. Existing benchmarks typically entangle multiple sources of variation, making it difficult to attribute failures to specific visual factors.

This work introduces a controlled, high-throughput framework that isolates visual shifts by construction, enabling precise and scalable analysis of visual generalization in reinforcement learning.

Visual Configurations

Examples of visual generalization gaps. Success rate for three train–evaluation pairs showing (left) negligible, (middle) moderate, and (right) severe generalization gaps.

KAGE-Env

KAGE-Env is a JAX-native 2D platformer environment in which the latent control problem is fixed while the observation process is factorized into independently controllable visual axes. Visual variation affects only the renderer, not dynamics or rewards, ensuring that performance changes arise solely from pixel-level perception.

The environment is fully vectorized and JIT-compiled, supports configuration via a single YAML file, and achieves up to 33 million environment steps per second on a single GPU.

Implementation Snapshots

Python (JAX) usage. The environment is configured from a .yaml file (e.g., custom_config.yaml); the code shows JAX-vmap/jit batched reset/step over 2¹⁶ parallel envs.

YAML configuration. KAGE-Env is configured via a single .yaml file; shown is a small excerpt of custom_config.yaml. We show only a small part of all configuration parameters; for details, see section on YAML configuration details.

Performance Visuals

KAGE-Bench: Motivation. Existing generalization benchmarks entangle multiple sources of visual shift between training and evaluation, making failures difficult to attribute. KAGE-Bench factorizes observations into independently controllable axes and constructs train–evaluation splits that vary one (or a selected set) of axes at a time, enabling precise diagnosis of which visual factors drive generalization gaps. The observation vector notation |ψ⟩ is used for intuition only.

Environment stepping throughput vs. parallelism. Environment stepping throughput (steps per second, higher is better) as a function of the number of parallel environments n_envs for KAGE-Env across heterogeneous hardware backends. GPU results are shown for NVIDIA H100 (80 GB), A100 (80 GB), V100 (32 GB), and T4 (15 GB, Google Colab), with CPU-only results on an Apple M3 Pro laptop. (a) Easy configuration: lightweight setup with all visual generalization parameters disabled. (b) Hard configuration: most demanding setup with all visual generalization parameters enabled at maximum values.

KAGE-Bench

KAGE-Bench is a benchmark protocol built on KAGE-Env that evaluates known-axis visual generalization using paired train and evaluation configurations. It defines six visual-axis suites with 34 configuration pairs, each varying only one visual factor such as background, agent appearance, distractors, filters, effects, or layout.

This design enables unambiguous attribution of generalization gaps and supports evaluation using both return and trajectory-level metrics such as distance, progress, and task success.

Axis-level summary of KAGE-Bench results (mean±SEM). During training of each run, we record the maximum value attained by each metric. For each configuration, these per-run maxima are averaged across 10 random seeds, and the resulting per-configuration values are then averaged across all configurations within each generalization-axis suite. We report Distance, Progress, Success Rate (SR), and Return for train and eval configurations, along with the corresponding generalization gaps (mean±SEM). Generalization gaps are color-coded: green indicates smaller gaps (better generalization), while red indicates larger gaps (worse generalization). ΔDist.=(Dist.^train-Dist.^eval)/Dist.^train×100%, ΔProg.=(Progress^train-Progress^eval)/Progress^train×100%, ΔSR=(SR^train-SR^eval)/SR^train×100%, ΔRet.=|Ret.^train-Ret.^eval|.

Per-configuration results for KAGE-Bench (mean±SEM). Each row corresponds to a train-evaluation configuration pair within a known-axis suite. For each run, we record the maximum value attained by each metric during training; these maxima are then averaged across 10 random seeds. We report Distance, Progress, Success Rate (SR), and Return for both train and eval configurations, together with the resulting generalization gaps. Abbreviations: bg = background, ag = agent, dist = distractor, skelet = skeleton. Generalization gaps are color-coded: green indicates smaller gaps (better generalization), while red indicates larger gaps (worse generalization).

BibTeX

@article{cherepanov2026kage,
      title={KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning}, 
      author={Egor Cherepanov and Daniil Zelezetsky and Alexey K. Kovalev and Aleksandr I. Panov},
      year={2026},
      eprint={2601.14232},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.14232}, 
}