Trainee-to-Trainer

LLM-as-Environment-Engineer: the current RL policy analyses its own failures and proposes the next-stage training environment configuration.

Chao Chen1 · Chengzu Li2 · Zhiwei Li1 · Yinhong Liu2 · Zhijiang Guo1,3

1 HKUST (Guangzhou) · 2 University of Cambridge · 3 HKUST

First author · Corresponding author

Trainee-to-Trainer framework: the current RL policy
                          analyses its own failures and proposes the next-stage
                          environment configuration.
Figure 1. Overview of the LLM-as-Environment-Engineer framework. At each stage, the current RL policy is rolled out on the previous training environment; its successes, failures, and training details are fed back to the same model, which then proposes the configuration for the next-stage environment.

Abstract

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration.

We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage.

With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

Case Study

We probe the trained policy from two complementary angles: (1) a small board where the base model fails and the trained model recovers the optimal plan, and (2) a much harder out-of-distribution board that pushes every controllable axis — agent count, map size, and obstacle density — to its maximum.

Case 1 — From failure to success on a 3×3 board 2 agents

Both checkpoints receive the same prompt; we visualise the executed plan token-by-token. The base model emits an illegal/incomplete trajectory; the trained policy recovers the optimal makespan-2 solution while spending 12× fewer tokens.

Before training

Base Qwen3-4B

The base model parses the map and proposes an action for Agent 0 but the trajectory for Agent 1 is incomplete, leaving an illegal/incomplete plan. The evaluator flags has_illegal_moves = true and all_goals_reached = false.

Tokens generated: 7,464

  • Valid
  • Optimal
Base Qwen3-4B executing its plan on a 3x3 MAPF case
Base model rollout.
After self-evolved training

Trainee-to-Trainer (ours)

After three rounds of LLM-engineered curriculum, the same model finds the optimal plan in three timesteps: Agent 0 moves LEFT, DOWN, LEFT, Agent 1 moves UP, LEFT, LEFT, with no collisions and no illegal jumps.

Tokens generated: 622

  • Valid
  • Optimal
Trained Qwen3-4B executing its plan on the same 3x3 MAPF case
Trained model rollout.

Case 2 — Generalising to a 10×10 maze 5 agents 50% holes

The model is trained exclusively on 2-agent boards, yet it generalises to the hardest setting our benchmark generator can produce: 5 agents, a 10×10 grid, and 50% obstacle density. The trained policy still returns a collision-free, makespan-optimal plan — evidence that the LLM-engineered curriculum teaches transferable MAPF skills rather than over-fitting to small boards.

Out-of-distribution test

Trainee-to-Trainer (ours)

Five colour-coded agents must reach their matching destinations on a 10×10 board littered with holes. The trained policy plans every agent's full trajectory in a single rollout — respecting vertex-, edge- and obstacle-constraints — and lands on the optimal makespan for the ground-truth solution.

Tokens generated: 3,512

  • Agents5
  • Map10×10
  • Holes50%
  • Valid
  • Optimal
Trained Qwen3-4B solving a hard 10x10 MAPF case
                          with 5 agents and 50% obstacle density
Trained model rollout on a 10×10 / 5-agent maze.

Experimental Results

Main results on the MAPF-FrozenLake benchmark across map sizes 3×3 to 10×10 and 3–5 agents. For each cell, acc. = valid rate (%) and opt. = optimal rate (%). Right-most Sum column reports the aggregate over all sizes. Bold = best in column.

3-agent evaluation set (map sizes 3×3 – 10×10).
Model 3×3 4×4 5×5 6×6 7×7 8×8 9×9 10×10 Sum
acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt.
GPT-5.4 64.6741.33 44.6729.33 42.6731.33 28.0016.00 20.6712.67 26.6715.33 20.0011.31 12.677.33 32.5020.58
Grok-4.2 36.0025.33 50.0037.33 29.3318.67 35.3321.33 29.3312.00 36.6721.33 36.0022.00 14.6710.00 33.4221.00
Gemini-3.1-Pro 45.3333.33 32.6723.33 35.3320.00 24.0012.00 18.679.33 20.0010.67 12.009.33 8.004.67 24.5015.33
Kimi-K2.5 66.0043.33 59.3334.67 57.3334.00 47.3334.67 38.0022.67 41.3324.00 36.6724.67 23.3316.00 46.1729.25
Qwen3-4B (base) 40.0038.00 24.0021.33 18.0016.67 10.6710.67 8.008.00 5.334.67 10.0010.00 2.672.67 14.8314.00
Qwen3-4B + GRPO (random) 54.6741.33 54.0040.67 50.6729.33 42.6726.00 38.6721.33 30.6718.00 32.6718.67 19.3313.33 40.4226.08
Qwen3-4B + GRPO + Ours 68.6748.00 64.6741.33 62.6735.33 52.6737.33 46.0026.00 42.6724.00 44.0025.33 32.0016.00 51.67 31.67
4-agent evaluation set (map sizes 4×4 – 10×10).
Model 4×4 5×5 6×6 7×7 8×8 9×9 10×10 Sum
acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt.
GPT-5.4 35.3322.00 24.0016.00 16.0010.00 14.009.33 16.6710.67 9.335.33 4.002.67 17.0510.86
Grok-4.2 48.0034.00 30.0021.33 32.0018.00 25.3316.67 5.333.33 8.676.67 14.677.33 23.4315.33
Gemini-3.1-Pro 28.6722.67 16.6714.67 12.6710.67 16.6712.67 8.675.33 4.672.00 3.332.67 12.9510.10
Kimi-K2.5 44.6728.67 35.3320.67 28.6716.67 28.0022.67 22.0016.67 20.6714.00 9.336.00 26.9517.90
Qwen3-4B (base) 10.679.33 4.674.00 2.672.67 4.003.33 2.002.00 0.000.00 0.000.00 3.433.05
Qwen3-4B + GRPO (random) 42.6727.33 33.3323.33 31.3318.67 26.6715.33 19.3312.00 19.3310.67 14.005.33 26.6716.10
Qwen3-4B + GRPO + Ours 49.3332.00 37.3325.33 36.6722.00 33.3324.67 31.3320.67 25.3316.67 18.678.00 33.14 21.33
5-agent evaluation set (map sizes 5×5 – 10×10).
Model 5×5 6×6 7×7 8×8 9×9 10×10 Sum
acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt. acc.opt.
GPT-5.4 17.3314.00 10.006.00 10.004.00 6.004.67 5.334.00 6.003.33 9.116.00
Grok-4.2 26.6716.00 13.339.33 16.6710.67 6.676.00 7.335.33 6.675.33 12.898.78
Gemini-3.1-Pro 9.338.00 6.674.67 5.334.00 4.673.33 2.001.33 0.670.00 4.783.56
Kimi-K2.5 23.3317.33 18.0012.00 16.007.33 14.008.67 8.004.67 6.002.67 13.478.78
Qwen3-4B (base) 2.672.00 2.672.67 0.670.00 2.002.00 0.670.67 0.000.00 1.441.22
Qwen3-4B + GRPO (random) 24.0016.67 21.3314.00 16.0010.00 13.338.00 8.672.67 7.333.33 15.119.11
Qwen3-4B + GRPO + Ours 28.0018.00 26.0017.33 22.0012.00 18.0010.67 10.002.67 8.005.33 18.67 11.00