PettingLLMs: USING ON-POLICY REINFORCEMENT LEARNING FOR STRONGER MULTI-AGENT SYSTEM

Introduction

Multi-Agent System (MAS) and Reinforcement Learning (RL) are both widely adopted to improve large language model (LLM) agentic performance. MAS strengthens task-specialized performance via role-based orchestration; RL leverages environment rewards to train stronger policies, such as Group Relative Policy Optimization (GRPO)-style optimization. Yet applying on-policy RL training to MAS is underexplored. While promising, it poses several challenges. On the algorithm side, Standard GRPO grouping assumptions fail in MAS because prompts differ by role and turn. On the system side, the training system needs to support MAS-workflow-based rollouts and on-policy updates for both single and multiple policy models. To address these issues, we introduce AT-GRPO, consisting of (i) an Agent- and Turn-wise grouped RL algorithm tailored for MAS and (ii) a system to support both single-policy and multi-policy training. Across game, plan, coding, and math tasks, AT-GRPO demonstrates substantial performance gains across diverse domains. Especially on long-horizon planning tasks, AT-GRPO boosts accuracy from a 14.0–47.0% single-agent RL baseline to 96.0–99.5%. Furthermore, it improves reasoning performance, with an average gain of 3.87–7.62% on coding and 9.0-17.93% on math.

Workflow overview — MAS workflow across different domains. (a) **Role-based coordination** : code generation via a coder–tester loop. (b) Different **task-specific workflows** for Game/Plan, Code, and Math

Method

AT-GRPO

AT-GRPO (Agent- and Turn-wise GRPO) is founded on three key principles. First, Tree-Structured Sampling constructs valid comparison groups by branching the exploration at each turn for every agent, sampling multiple actions, and greedily selecting the optimal one to proceed. Second, Agent- and Turn-wise Grouping enables fair advantage estimation by grouping experiences according to both the agent and the corresponding turn. Finally, Agent-wise Credit Assignment achieves a balance between collective objectives and individual responsibilities through a mixed reward scheme that integrates a global team reward with a local, role-specific reward.

MAS TRAINING SYSTEM

In our MAS training system, each model is assigned an independent, GPU-pinned resource pool that includes a dedicated Rollout Worker for inference and an Update Workerfor optimization. Environment interactions are executed in parallel across a CPU pool, while a central Router dispatches the collected trajectories to the corresponding Update Worker associated with the policy that generated them.

Demo

Case Study

MAS for Game

Task: Planner proposes the next action sequence; Executor calls environment tools (simulator, legality checker, shortest-path/BFS helper) to apply actions and return effects/observations (updated grid, agent/box poses, success/failure flags). Episode ends when the goal is met (all boxes on targets) or the turn budget is reached.

Before RL: The Plan Agent gets a valid path for the box from Tool agent but completely misses the point. It tries to follow the box's path itself, runs straight into a wall, and fails instantly. It doesn't understand its job is to push the box, not be the box.

After on-policy RL in MAS: RL teaches the agent the difference. It learns that rewards come from moving the box along the designated path. This insight forces it to discover the correct low-level strategy: first, navigate behind the box, then execute the push

MAS for Code

Task: Coder writes a solution; Unit-Tester writes tests. Terminate=all tests pass. Otherwise: each agent revises its own previous output using the environment feedback/results (Coder fixes code; Unit-Tester fixes unit test), then re-run.

Before RL: The Plan Agent gets a valid path for the box from Tool agent but completely misses the point. It tries to follow the box's path itself, runs straight into a wall, and fails instantly. It doesn't understand its job is to push the box, not be the box.

After on-policy RL in MAS: RL teaches the agent the difference. It learns that rewards come from moving the box along the designated path. This insight forces it to discover the correct low-level strategy: first, navigate behind the box, then execute the push.

About Us

PettingLLMs Team Members

Yujie Zhao
Lanxiang Hu
Zhijing Wu
Junbo Huang
Zhongming Yu

Collaborators

Minmin Hou
Intel Corporation
Yang Wang
Intel Corporation

Advisors

Affiliation: Department of Computer Science and Engineering, UC San Diego

Table of Contents