RL Baseline (FPPO)

The RL baseline reproduces CausalMoMa (Hu et al., RSS 2023) — Factored PPO with a learned causal matrix that maps reward channels to action dimensions.

Overview

The policy controls all 13 DOF of the Fetch robot (base, torso, arm, gripper, head) from RGB-D observations and proprioceptive state. Rewards are decomposed into 8 independent channels, each associated with a subset of action dimensions via a sparse causal matrix discovered through Conditional Mutual Information (CMI).

Reward Channels

Channel

Description

Scale

reach

Potential-based L2 distance (EE to target) + dense goal bonus (+10/step at < 0.1 m)

0.7

ee_orient

End-effector orientation error (keep grasp-ready)

0.5

ee_local_pos

EE height relative to target height

0.5

base_col

Binary base/head collision penalty

1.0

arm_col

Binary arm collision penalty

1.0

self_col

Binary self-collision penalty

1.0

gaze

Target visible in head camera FOV (requires --encourage-gaze)

1.0

grasp

Gripper action reward at target proximity

1.0

Action Space (13-dim continuous)

v, w, torso, shoulder_pan, shoulder_lift, upperarm_roll,
elbow_flex, forearm_roll, wrist_flex, wrist_roll, gripper,
head_pan, head_tilt

Setup

The RL baseline uses a dedicated rl pixi environment that is lighter than the full default environment (no ROS, GraspGen, VAMP, torch-scatter, or spconv):

pixi install -e rl

Training

Requires a CUDA GPU. Set OMP_NUM_THREADS=1 to prevent LAPACK thread deadlocks with ManiSkill’s GPU workers.

Quick Start

OMP_NUM_THREADS=1 pixi run -e rl python -m TyGrit.rl.train

CLI Arguments

Flag

Default

Description

--num-envs

64

Number of parallel environments

--total-timesteps

5,000,000

Total training steps

--log-dir

runs/fppo

Directory for checkpoints and logs

--device

cuda

Torch device (cuda, cpu, cuda:1, …)

--render

off

Enable GUI rendering

--resume

Path to checkpoint to resume from

--no-wandb

off

Disable Weights & Biases logging

--encourage-gaze

off

Enable gaze reward channel (head tracks target)

Common Options

OMP_NUM_THREADS=1 pixi run -e rl python -m TyGrit.rl.train --num-envs 64 --total-timesteps 100000000 --log-dir runs/my_experiment --no-wandb --render --resume runs/fppo/checkpoint_100.pt

Key Hyperparameters

All hyperparameters are in TyGrit/rl/config.py (TrainConfig). The defaults follow CausalMoMa:

Parameter

Default

Description

num_envs

64

Parallel ManiSkill environments

rollout_steps

2048

Steps per rollout (covers ~4 episodes)

total_timesteps

5,000,000

Total training steps

batch_size

512

Mini-batch size for PPO updates

n_epochs

10

PPO epochs per rollout

policy_lr

5e-5

Policy learning rate

value_lr

1e-4

Value network learning rate

gamma

0.99

Discount factor

gae_lambda

0.95

GAE lambda

clip_range

0.2

PPO clip range

target_kl

0.15

KL early stopping threshold

max_episode_steps

500

Episode truncation length

GPU Memory

The rollout buffer for RGB-D observations is the dominant memory consumer. Approximate GPU memory usage:

num_envs

Rollout buffer

Total (approx.)

16

~2 GB

~6 GB

32

~4 GB

~10 GB

64

~8 GB

~16 GB

If you hit OOM, reduce --num-envs.

Logging

Training logs to console and optionally to Weights & Biases. Tracked metrics include:

  • Per-channel reward means (reward/reach, reward/base_col, etc.)

  • Episode return, length, and success rate

  • Policy loss, value loss, entropy, explained variance

Checkpoints are saved every 100 rollouts to runs/fppo/.

Evaluation

Load a trained checkpoint and run with --render to visualize:

OMP_NUM_THREADS=1 pixi run -e rl python -m TyGrit.rl.train \
    --resume runs/fppo/final.pt \
    --render \
    --no-wandb \
    --num-envs 1

Reference

Jiaheng Hu, Peter Stone, Roberto Martín-Martín. CausalMoMa: Real-time Whole-body Mobile Manipulation via Causal Factorization. RSS 2023. GitHub