‘Hey! Nous Research here - we’re the authors of an LLM RL environments repo called Atropos which is designed to provide rollouts for multi-environment runs, and where each individual env can be single-turn, multi-turn, or multi-agent, R1-zero style, or have a custom chat template. Furthermore, environments can define token-level advantages and so are not necessarily tied to the same RL training algorithm’
Hey, wanted to reach out because i’m getting close to being done with the bounty, but i have a couple of questions for getting it cleaned up.
Currently it seems like the generate endpoint with raw tokens on verl’s custom vllm instance isn’t hooked up so im doing something a bit hacky and tokenizing on both sides after passing the string response from the model which relies on the right tokenizer being used on both sides ( works, but you CAN get the raw tokens and work with those which would be a lot better ) and just wanted to see if there were any issues with that. ( you can see the TODO at around ~450 in the vllm_async_server.py in verl )
I implemented token masking, but do you want a working example of a model trained on the token mask? i can set one up, but i’m not sure if you want that on the recipe’s end or in another repo for cleanliness ( or if you have one available i could just just leverage in the atropos repo )
link to the current state of things ( i still have to clean up some code so this is 100% not cleaned up, but feel free to message me in the discord or here if you have any critiques ) :