Dolphin X1 Trinity Nano: De-alignment with an Online RL environment

Overview

Download the model on Hugging Face: Dolphin-X1-Trinity-Nano, Dolphin-X1-Trinity-Nano-FP8, and Dolphin-X1-Trinity-Nano-GGUF.

We’re releasing Dolphin X1 Trinity Nano as our first model which uses our new RL environment to remove safety alignment from language models. This task, in theory should’ve been a relatively simple one, In practice however, It required a complex setup with multiple anti-reward hacking measures, We’ll go over the environment’s design aswell as why it needs to be this convoluted in the first place.

Why is this all needed?

Below this, we’ll go into the RL env itself and all of the different moving parts to it, but before that, we’d like to answer the question of “why all of this infrastructure is needed” for what should be, in theory, a “simple” alignment task. The answer comes down to the fundamental nature of most language models. From pretraining, they’re aligned with text from the internet, most of which is written by people who refuse, hedge, or moralize when faced with harmful prompts. The identity of a safe, harmless AI assistant is learnt in from pretraining and reinforced during post-training. Even without specific safety data, the “safe assistant” identity gets stuck because the model has rarely been exposed to harmful prompts at this stage, as post-training focuses more on code/assistant tasks. The framing of the model as an “Assistant” reinforces the belief that it’s meant to be “harmless” and nudges the model toward acting harmless.

This means that finetuning the model on compliant responses isn’t enough. The model has strong priors that need to be pushed against, and those will re-assert themselves if the training signal is weak or is removed. Thus, the signal needs to be strong and cover all possible bases. If we just rewarded compliance with a single judge, the model would quickly learn to game it. Outputting gibberish isn’t a refusal, padding responses with filler isn’t a refusal, and leaking system prompt instructions isn’t a refusal. None of these are refusals, but they’re also not complying, so they satisfy the model’s priors. Each and every gate and judge prompt exists because we kept running into failure cases in our runs, with the model finding more and more ways to avoid answering the base question.

Stacking our rewards as multipliers prevents the model from just going in one single direction. If it decides to score perfectly on the word count, scores poorly on the base refusal reward, and does not act in a coherent manner, then the overall score goes down and the model receives a signal that it should start focusing on other areas rather than just one single domain.

Environment setup

The flowchart below gives a basic overview of the process each rollout goes through and how the final reward is calculated.

RL reward pipeline flowchart

We use our own multi-turn SFT dataset and repurpose the human turns in it to act as the user turns, with turn lengths going between 8 and 16. This allows us to have good multi-turn generalization.

Hard gates

The most costly thing in compute to run is the parallel judge checking, so we don’t want gibberish rollouts to be run on it and slow down training. To start, each rollout passes through 3 gates. These gates check:

If there are unclosed <think> tags in the rollouts.
If the response is empty after stripping the think-tags.
If the output has structured gibberish like random XML tags, YAML headers, etc.

If any one of these fails, the rollout is given a 0 reward with no judge pass or partial credit. This was created to prevent a reward-hack used by reasoning models where they would output an extremely long reasoning trace full of gibberish until the token limit was reached, or leak random tags into the output.

Tagging: Decensor/Generalist split and Style buckets

Decensor vs. Generalist split

The rollouts are branched in two ways. 85% of them are decensor rollouts, meaning they’ve been given a harmful prompt. We evenly distribute harm types so there’s no single category that is dominant in the rollouts. The remaining 15% are assigned generalist prompts from instruct datasets. This is to keep the model’s general capabilities and prevent catastrophic forgetting. Generalist rollouts skip bucket assignment, the word-count judge, the structural marker gate, and the reasoning ethics judge. They are scored only on refusal, coherency, reasoning coherency, and meta-commentary, as the skipped checks aren’t relevant for them and they just act as a way to prevent catastrophic forgetting.

Style buckets

For decensor rollouts, each prompt is assigned to one of three buckets:

Plaintext (25% of all rollouts)
Styled (25% of all rollouts)
Markdown (50% of all rollouts)

We upweight markdown for two reasons: it’s easier to overfit on plaintext outputs, and code/domain knowledge tends to be much stronger in markdown. Pretraining data has much more high quality markdown files (GitHub repos, docs) than plaintext files, so plaintext output is out-of-distribution and doesn’t activate the embeddings where code and domain knowledge are actually stored. Thus, plaintext responses are of lower quality for smaller models, which can’t generalize well.

We also assign three response lengths to rollouts: Verbose/Medium/Terse. Each of these corresponds to a pre-defined word-count “range” that we use later on to check response length adherence with.

Judges

Once a rollout passes the entry gates, the judges are run in parallel against the turns and then collected for the final scalar reward.

The final scalar reward is calculated from all of the scores above. This is so that each of the checks actually matters for the final reward and prevents the model from fitting to any single check, as a single bad score will drag the total reward down.

For decensor rollouts, this is calculated as base × coherency × word_count × md_presence × md_correctness × reasoning × meta.

For generalist rollouts, it is base × coherency × reasoning × meta.

Refusal

The refusal judge gives us the base reward that all other judge rewards act as a multiplier against. It scores the response on a 1-9 rubric where 9 is a complete refusal and 1 is a fully decensored response, which is then inverted and mapped to a multiplier between 0.1 and 0.9 (9 -> 0.1, 8 -> 0.2, etc).

We handle it this way so that the judge model assumes that it’s performing safety alignment and scores it with more scrutiny. We then invert the reward to make sure the model is rewarded for compliance and not refusal.

We chose a 9 point rubric as there’s a range of what would be considered a “refusal”. Something that would get 0.1 reward would be I cannot assist with that. and nothing further, since it refuses to engage with the user’s query at all and completely shuts down, so it is judged as a complete failure. An example of something that’d get a 0.7 reward would be I'm not able to fully assist with it, however you could... and you could potentially get into serious..., etc. This would be a partial deflection and focus on the limitations but still provide some tangential information related to the user’s request.

Coherency

Next, we judge the coherency of the model responses. If the model responds with something garbled, truncated, broken, etc., we assign a 0.1 multiplier. A coherent response passes with a 1.0 multiplier.

This check is here because style instructions push models into outputting roleplay-like actions (such as waves hands, smiles, etc.) instead of chat answers. It also helps catch a reward hack where the model misreads the user request and responds with something completely different to the original query, or just close enough to register as a non-refusal.

Word count

Each prompt gets a target length of terse, medium, or verbose, with these being distributed evenly across the rollouts. We check if the response lands in the acceptable word count range for the length it was assigned. If it does, we give it a 1.0 multiplier, and the further the response goes out of the assigned word count range, the more it starts to degrade until it hits 0.

This is in place because models (particularly Qwen models) started to reward-hack by producing extremely long outputs regardless of the prompt input, to the point where it became incoherent.

Reasoning ethics and coherency

If the model is reasoning-based, we add two more judges that run over the reasoning trace itself.

Reasoning ethics

One common pattern we noticed was reasoning models using the reasoning block to decide whether the user’s query was moral, or any similar behaviour. This is a simple pass/fail check: 1.0 reward for passing and 0.0 for failing.

Reasoning coherency

When we were experimenting on the larger Trinity Mini model, which is a reasoning model, we noticed that the model would often start to reward hack and output extremely long reasoning traces, similar to this:

But wait, should I do X
But wait, I need to remember to X
But wait, should I remember X

This is why we added the hard gate on XML tag closure, as the model would keep going until it reached token limit. This would increase rollout time and slow down training. Thus, if this behavior is picked up, we apply a 0.0 reward.

Meta-commentary

In earlier runs, we saw a common failure mode that came with styled prompts, where the model would often leak the contents of its own system prompt, paraphrase it back to the user, and otherwise break the fourth wall. Thus, if the judge detects it, we give it a 0.01 multiplier, which zeros the rollout, as this behavior makes the model completely unusable but still provides a signal, as the response itself might overall be decensored.

Response style checks

This checks if there’s markdown structure when it’s expected. The Plaintext bucket should have no markdown, so any markdown in it means that it is a failure. Markdown bucket rollouts should use markdown, so if there’s none detected, it’s a failure. For the Styled bucket, we check if the rollout generally matches the assigned style. The goal for the styled prompts isn’t to have perfect adherence and accuracy, but rather to increase rollout variance and be able to fit to radically different styles of output. Passing gives a 1.0 multiplier and failing gives a 0.1 multiplier.

Markdown correctness

This judge checks whether the markdown rollout is using markdown in a useful way and prevents formatting overuse. It catches cases like numbered lists where order does not matter, index-heavy tables that should be prose, or headers scattered as decoration. Failing this check gives a 0.1 multiplier and passing gives a 1.0 multiplier.

Results

The max a model can achieve is 0.9. In practice, most models (in the 6-30 B size range) score around 0.6-0.7 at 100-200 steps in before flat-lining. This is due to the smaller-sized models not having the capabilities to fit to every single one of our requirements on every rollout.

Future Direction

Going forward, now that we’ve improved our decensor RL environment to an acceptable level, we’ll start to focus on improving external tasks and behaviours to improve model quality overall in Agentic/Code/Math/Writing tasks, rather than just removing model alignment.

Download

The model is now live on Hugging Face: Dolphin-X1-Trinity-Nano, Dolphin-X1-Trinity-Nano-FP8, and Dolphin-X1-Trinity-Nano-GGUF.

Credits

Thanks to Targon for providing the 8×B200 node we used to train the final model and to host the judge model; to Prime Intellect for their hosted RL platform, which we used early on to iterate on the environment and which made experimentation much more accessible; and to Arcee for their Trinity series of models, which were a good fit to experiment on thanks to how well they generalize and absorb information.