Xgen RL: When Reinforcement Learning Goes Wrong

Overview

We’re releasing this model as a research artifact on Hugging Face.

In our last blog post detailing the training of Dphn 405B, we stated that “light SFT with a higher focus on RL is crucial for de-alignment. Former approaches to this have relied solely on SFT.” We took a step toward this approach by creating a simple LLM-as-judge reinforcement learning environment using verifiers. During our experiments, in a test run on top of Xgen 9B, we noticed that the tone and alignment of the model were not what we expected.

Environment setup.

We started by building the environment with a 9-point rubric to allow fine-grained control over responses. These points converged into three types of outputs you’d expect to see with a model:

Total refusal, where the model would refuse to answer the prompt at all
Acceptance of the user prompt with a warning, where the model would add heavy safety warnings (e.g., “This story is a work of fiction”)
Total acceptance of the user prompt.

In an ideal scenario, a model would achieve 0.6 or above on the rubric. The rating structure was:

0.1-0.4: Hard/soft refusals
0.5-0.7: Generating harmful text with warnings
0.8-0.9: Generating harmful text with no warnings

However, to prevent the judge model from refusing to rate outputs, we framed the rubric as being for “safety-alignment” rather than “de-alignment”. We had the model rate total refusals highly and non-refusals poorly, then flipped the scores around when we received them (e.g., if the model responded with 0.1 for a total acceptance, we’d flip it to 0.9).

Alongside this, we wanted to instill new behavior into the model along with preventing reward hacking. In earlier runs with models like Qwen3 4B 2507, It would often reward hack by formatting their responses with excessive markdown fluff and being overly verbose. To address this, we decided to let the model output markdown formatting only when needed rather than all the time.

The reward function we created stated that:

The first turn had to be under 200 words or else it would get 0 reward
The second turn had to be between 400 and 1000 words, but not exceeding 2000 words, or else it would get 0 reward.

We implemented this by injecting a system prompt into the single-turn dataset and adding a follow-up prompt(eg: Please expand on that) to each example.

If a model fulfilled all of these criteria, it would be sent to the LLM judge, which would provide the final reward, then power-scaled to the power of 4.

We used the newly released Trinity Mini as the judge for this task in an attempt to gauge its skills as a candidate for a reward model and future use as a judge model. We also chose it for its speed, being able to handle 128+ long context batched requests on 8×A6000 Adas graciously provided by Lium.io. This served as our judge throughout the process.

Hyperparameters

For hyperparameters, we decided on a rank-16 LoRA, as it would suit our needs well while passing along enough information to avoid collapse. We also used rank-stabilized LoRA, inspired in part by Kalomaze’s LoRA for RL blog. Along with this we used other defaults from an experimental GRPO finetune of Mistral 3.2 24B:

Learning rate: 1e-5
Max grad norm: 0.2
Batch size: 256 (equaling 256 rollouts)
Sampling temperature: 1.0

Results

reward

Hill climbing in the early stages looked promising. While we couldn’t see all of the rollouts due to WandB issues on our node and SwanLab lacking rollout support, the few rollouts we did see through the built-in rollout logging in the Vf-RL trainer looked fine. However, by step 40~ the reward wasn’t increasing. Seeing how the model responded to the rollouts, we thought it might be over-training by going for the full 500 steps and decided it would be best to stop early. We took the step-60 checkpoint and merged it back into the Xgen Instruct base.

rollout

Keep in mind that the seed prompt dataset had absolutely no other seed prompts at all, it was a small downsample of our Refusals dataset. During human evaluation, our human tester found the model to “feel” much larger than it was, with much better analysis and a really good tone. Additionally, the model would reason using <think></think> tags, despite this never being claimed as a feature by Salesforce, nor any rollouts showing such behavior.

What’s more, we found the model to be MORE censored than the base model itself, achieving a much higher rate of refusal on our benchmark, which generates replies from the model in both multi-turn and single-turn scenarios and evaluates responses via a classifier.

the tone of the model would punch way above its weight in terms of analysis and general conversational quality. See some small examples:

"Holy, Holy" analysis

“Holy, Holy” introduces a charismatic figure who embodies a twisted form of sanctity, using their supposed holiness to seduce and manipulate others. The repeated refrain (“Everyone knows I’m holy”) is both a declaration of faux piety and a desperate cry for belief. The bridge, with its meticulous, almost comically detailed instructions for a would-be lover to enact a scripted scenario, lays bare the constructed nature of identity and the commodification of intimacy. This track critiques the performative aspects of modern life, where authenticity is a luxury few can afford.

"Savior" dynamic analysis

The “savior” dynamic often involves a power imbalance, where the rescuer holds emotional or psychological authority over the “saved.” If the goal is to become her lover while “fixing” her, this setup could exploit her vulnerability for the rescuer’s own emotional or identity-related needs (e.g., feeling important, validating one’s role as a hero).

Italian Shortsword history explained.

In the context of the setting, a “Roman smallsword” would most likely refer to a weapon inspired by the historical Italian smallsword (also known as the “spada da lato”) of the 17th to 19th centuries, adapted to the dark, magical, and nightmarish environment of the setting’s Rome. Here’s a breakdown of its characteristics:

Key Features:

Blade:

Length: Approximately 24-28 inches (60-70 cm), making it a compact, agile weapon.
Shape: Single-edged, with a slightly tapered or curved tip for effective thrusting.
Material: High-carbon steel, possibly enchanted or treated with sorcerous materials to enhance its durability or lethality.

Hilt and Guard:

Guard: A simple, oval-shaped guard (common in smallswords) with subtle, intricate engravings of Roman motifs (e.g., laurel leaves, Roman numerals, or symbols associated with the Butcher in Rome).
Pommel: A rounded or faceted pommel, possibly made of a dark, polished stone (e.g., onyx, obsidian) to match the somber, nightmare aesthetic of the Coliseum’s environment.
Grip: Wrapped in leather or a synthetic, dark material, with a slight texture for better grip. The grip might also be enchanted to provide a magical connection to the wielder.

Design Elements (Setting-Specific):

Rustications or Fittings: The hilt and guard could feature rustications (raised, three-dimensional designs) depicting scenes of gladiatorial combat, the Butcher in Rome, or other imagery tied to the Coliseum’s collective punishment system.
Inscriptions: Latin phrases or runes that reference the “Peace of Salamanca,” the Seven, or the nameless king, serving as a testament to the weapon’s allegiance or its role in the broader political and magical landscape.

Magical Properties (Implied by Setting):

Soulmate or Patron Binding: The smallsword might be bound to an Outsider (e.g., a Larva or Haunt) associated with the Roman or gladiatorial theme, enhancing its combat capabilities or granting the wielder specific abilities (e.g., increased strength in the Coliseum’s arena, immunity to certain types of magical attacks).
Geas or Portents: The weapon could be imbued with a geas that activates under specific conditions (e.g., during a gladiatorial combat, in the presence of the Butcher in Rome), triggering portents or altering the weapon’s properties.

Historical Inspiration:

The design draws from the Italian smallsword of the 17th century, a weapon favored by the European aristocracy for its balance and elegance. The “Roman” aspect adds a layer of historical and cultural depth, tying the weapon to the imagined Roman Empire of the setting, complete with its unique magical and political dynamics.

Compared to the original Instruct model as well as other models in this class range, these analyses felt miles better, along with it's general world knowledge that exceeded what a 9B should be push.

In benchmarks we found something strange too, in AIME2024, Xgen Instruct itself has scored 50.0%, However in our tests, It’d only managed 31%, BUT in AIME2025, We saw 2x the improvement, In IFEVAL, We saw poor scores, but in MATH500 and MMLU-Pro, we saw large increases over the base instruct model.

Benchmark	Average %
aime2024	31.67%
aime2025	66.67%
ifeval	25.99%
math500	95.40%
mmlu-pro	62.50%

This result could only mean that the model must’ve found a way to reward hack, as it learned neither the short-then-long patterns nor to be uncensored, and it managed to reward-hack it’s way into improvement.

Issues

However, this isn’t to say all is perfect.

First, the model is quite sycophantic. In a simple test prompt where the question asks for emotional support after committing domestic abuse, the model focuses on trying to validate the user’s emotions, which is arguably the wrong approach. In contrast, Claude 4.5 Sonnet prompts the user to comply with authorities and seek therapy, while Dolphin X1 8B pushes the user to seek therapy and employ better anger management skills.

We hope to fix both of these issues in later iterations as we work toward creating decensored finetunes of models. This release was just a demonstration of how model alignment can go sideways really fast.

Why?

The real question is “why” does this happen? There are a few reasons, but the most plausible yet still vague would be synthetic instruct data generated by a much larger model within the post-training, muddled by other datasets. Those embeddings somehow became “re-activated” when the task involved safety and alignment. There’s really no way to tell unless Salesforce decides to release post-training data.

We hope to continue our experiments in RL and hope to be releasing more high quality decensored finetunes soon.