Controlling Repetition in Protein Language Models

TL;DR

Protein language models frequently collapse into long repeats (AAAAAA) and short motif loops (AGAGAG). These sequences do not fold. We quantify this failure, and introduce UCCS — Utility-Controlled Contrastive Steering — a training-free method that reduces repetition while simultaneously raising AlphaFold confidence across ESM-3 and ProtGPT2.

01

Repetition is a structural failure

Motif-level and homopolymer collapse in PLMs tracks AlphaFold pLDDT, not just token diversity.

02

Decoding penalties are not enough

Temperature, top-p, and repetition-penalty either barely move repetition or lower foldability.

03

Utility-matched contrastive steering wins

UCCS is the only method that improves both repetition and utility on all tested datasets and models.

The problem: PLMs repeat, and repetition breaks folding

Raw ESM-3 / ProtGPT2 generations often collapse into low-complexity sequences that AlphaFold cannot confidently fold. Natural proteins (top row) vs. unmodified model outputs (bottom rows). Color = per-residue pLDDT (blue ≈ 100, orange/red < 50).

Natural
proteins

Natural CATH protein, 128 residues, high pLDDT throughout — CATH · 128 aa

Natural CATH protein, 256 residues — CATH · 256 aa

Natural CATH protein, 513 residues — CATH · 513 aa

ESM-3
raw output

ESM-3 raw output, 512 residues — ESM-3 · 512 aa

ProtGPT2
raw output

ProtGPT2 raw output, 511 residues — ProtGPT2 · 511 aa

Repetition score R(x)

H_norm — normalized unigram entropy (global balance)
Distinct-2/3 — 2-/3-gram diversity (motif loops)
R_hpoly — homopolymer penalty (runs ≥ 4)

Higher R(x) = less repetitive.

Utility U(x)

AlphaFold-derived structural confidence:

$U(x) \;=\; \tfrac{1}{2}\bigl(\mathrm{pLDDT}(x) + \mathrm{pTM}(x)\bigr)$

Higher U(x) = more confident fold.

The core obstacle. On real datasets, low-repetition and high-utility are entangled. Naively steering "away from repetition" usually also pushes the model away from foldability — because the direction you learn from raw contrastive data is a mixture of both.

Method — UCCS

UCCS builds contrastive sets that are matched in utility and separated in repetition, so the mean-difference vector isolates a pure "repetition direction" in hidden space.

UCCS pipeline: score candidates by (R,U); filter to a utility band; contrastive selection maximizing ΔR with ΔU bounded; compute steering vector v^L; inject at inference. — (a) Score a candidate pool by $(R, U)$. (b) Filter to a utility band and maximise $\Delta R$ subject to $\Delta U \le \epsilon$ (Pareto / composite ranking). (c) Take the mean-difference vector $v^L = \mathbb{E}_{x \in \mathcal{D}^+}[\phi^L(x)] - \mathbb{E}_{x \in \mathcal{D}^-}[\phi^L(x)]$ at a chosen layer. (d) At inference, inject $\tilde h^L = h^L + \alpha \, v^L$.

Formulation

$$\min_{f}\; R\!\left(f(M, p)\right) \quad \text{s.t.}\quad U\!\left(f(M, p)\right) \;\ge\; U(M, p) - \epsilon$$

We do not retrain $M$. Instead we modify its forward pass by adding a fixed direction to hidden activations at a chosen layer.

Representation φ^L

$$\phi^L(x) \;=\; \begin{cases} \tfrac{1}{T}\sum_{t=1}^T h^L_t(x) & \text{MLM (ESM-3, mean pool)}\\[4pt] h^L_T(x) & \text{AR-LM (ProtGPT2, last token)} \end{cases}$$

MLMs encode bidirectionally, so mean-pool; AR-LMs concentrate predictive information at the last token.

Dataset construction

Pareto or composite selection over ~10 k candidates → ~100 matched pairs. Utility matching is what makes the steering direction clean.

Strength α

Default $\alpha = 1$. Unimodal across $[0.5, 4]$. Larger $\alpha$ emphasises $R$; smaller $\alpha$ protects $U$.

Injection layer L

Late layers work best. ESM-3: $L \approx 46/48$. ProtGPT2: $L \approx 30/36$. Early-layer injection is inert or harmful.

Main results

Across CATH, UniRef50 and SCOP, on both ESM-3 (MLM) and ProtGPT2 (AR-LM), UCCS is the only method that improves both repetition $R$ and utility $U$ relative to the unmodified baseline.

R vs U scatter for ESM-3 on CATH, unconditional generation, showing UCCS in the top-right quadrant above all baselines. — ESM-3 · CATH · unconditional. UCCS (top-right) is the only method above both $R_{\mathrm{orig}}$ and $U_{\mathrm{orig}}$.

+52 % repetition score $R$ 0.423 → 0.645

+13 % utility $U$ 0.576 → 0.652

Temperature sampling raises $R$ to 0.751 but drops $U$ to 0.566 — the trade-off UCCS breaks.

ESM-3, CATH, unconditional. ✓ marks $U \ge U_{\mathrm{orig}}$.
Method	R ↑	U ↑
Original (no intervention)	0.423	0.576 ✓
Temperature sampling	0.751	0.566
Top-p sampling	0.419	0.590 ✓
Entropy-based sampling	0.904	0.477
Neuron deactivation	0.551	0.507
Probe steering	0.415	0.587 ✓
UCCS (ours)	0.645	0.652 ✓

R vs U scatter for ProtGPT2 on CATH, unconditional generation, with UCCS above all baselines on both axes. — ProtGPT2 · CATH · unconditional. Same pattern: UCCS sits outside the Pareto frontier of the baselines.

+16 % repetition score $R$ 0.728 → 0.845

+14 % utility $U$ 0.621 → 0.711

Repetition-penalty and n-gram blocking barely move $R$ and do not help $U$.

ProtGPT2, CATH, unconditional. ✓ marks $U \ge U_{\mathrm{orig}}$.
Method	R ↑	U ↑
Original (no intervention)	0.728	0.621 ✓
Temperature sampling	0.756	0.612
Top-p sampling	0.714	0.608
No-repeat n-gram	0.736	0.613
Repetition penalty	0.780	0.622 ✓
Neuron deactivation	0.719	0.610
Probe steering	0.722	0.607
UCCS (ours)	0.845	0.711 ✓

Same pattern holds for UniRef50 and SCOP, and for conditional (prefix-10) generation — see the paper's Tables 1 and 2 for all 12 settings.

Ablations — the three knobs

A. Utility-matched dataset sampling

Pareto and composite selection beat random sampling — both lift $R$ and $U$ and shrink variance across seeds. Without utility matching, the mean-difference vector absorbs foldability signal too, and UCCS collapses to a decoding-penalty-like trade-off.

B. Steering strength α

ESM-3 alpha ablation: unimodal curve peaking near alpha=1-2 — ESM-3

$\alpha$ is clearly unimodal in $[0.5, 4]$. Default $\alpha = 1$ gives the best $R$–$U$ harmonic mean. Conditional generation is noticeably more robust — a wider flat top.

C. Injection layer L

ESM-3 layer ablation: late layers win — ESM-3 (48 layers)

ProtGPT2 layer ablation — ProtGPT2 (36 layers)

Injection in early layers is inert or harmful; late-layer injection (ESM-3 ≈ L46/48; ProtGPT2 ≈ L30/36) is monotonic-improving up to the penultimate layer — consistent with repetition features concentrating late in depth.

Beyond the main paper

The appendix extends the study with (i) two additional protein language models spanning a new generation paradigm, and (ii) an initial mechanistic analysis of where "repetition" lives inside PLM activations.

Extra models — autoregressive and diffusion

ProGen2-base autoregressive

Same baselines as ProtGPT2 (temperature, top-$p$, no-repeat-$n$-gram, repetition penalty). Across CATH / UniRef50 / SCOP, decoding heuristics barely move $R$ in the unconditional setting, while UCCS delivers consistent gains and is the only method satisfying the utility constraint across every condition. The latent repetition direction generalises to ProGen2's distinct architecture and training corpus.

DPLM-650M diffusion

A completely different generation mechanism. We sweep diffusion-specific knobs (sampling strategy, resample ratio, internal-resample on/off) — they trade $R$ and $U$ in unstable ways. UCCS still wins: $R = 0.863\text{–}0.879$ (unconditional) and $0.881\text{–}0.894$ (conditional), with $U$ improved over the original model. The method transfers across AR, MLM, and diffusion PLMs.

Where does repetition live? — initial mechanism analysis

Neuron-level correlation

Pearson correlation between each neuron's activation and the repetition score, across all layers. Distributions are centered near zero with no isolated high-correlation outliers — repetition is not driven by a small set of specialised units.

Takeaway. Repetition is a distributed representational pattern → aligns with a direction/subspace in activation space, not individual neurons. This is exactly the object UCCS isolates via the mean-difference vector.

ESM-3 neuron-repetition correlation histogram across layers; bimodal in several layers — ESM-3 — broader, often bimodal

ProtGPT2 neuron-repetition correlation histogram; narrow and unimodal — ProtGPT2 — narrower, unimodal

ESM-3's broader, more polarised distribution matches its empirically more severe repetition: architecture modulates the strength of the repetition subspace.

Layer-wise linear probes

Linear probes trained to predict the repetition score from hidden activations at each layer. Probe performance grows sharply in later layers — the model's late-layer geometry encodes repetition most explicitly.

Takeaway. Same layers that probes find most informative are the layers where UCCS injection works best (ESM-3 $L \approx 46$, ProtGPT2 $L \approx 30$). The ablation in §Ablations is not a coincidence — it's consistent with the representational geometry.

ESM-3 layer-wise probe metrics, rising in late layers — ESM-3

ProtGPT2 layer-wise probe metrics — ProtGPT2 — same pattern, weaker magnitude

Full tables for ProGen2 / DPLM and detailed mechanism figures are in the paper appendix.

Generated proteins with UCCS

Unconditional generations from ESM-3 and ProtGPT2 after UCCS injection. Color = per-residue pLDDT (blue ≥ 90, cyan 70–90, orange/red < 70). Most cases are dominantly blue — the outputs actually fold. Click any image to enlarge.

ESM-3 generated protein, 122 residues, mostly blue high-confidence — ESM-3 · 122 aa

ESM-3 generated protein, 137 residues — ESM-3 · 137 aa

ESM-3 generated protein, 142 residues — ESM-3 · 142 aa

ESM-3 generated protein, 245 residues — ESM-3 · 245 aa

ESM-3 generated protein, 253 residues — ESM-3 · 253 aa

ESM-3 generated protein, 254 residues — ESM-3 · 254 aa

ESM-3 generated protein, 264 residues — ESM-3 · 264 aa

ESM-3 generated protein, 57 residues — ESM-3 · 57 aa

ESM-3 generated protein, 70 residues — ESM-3 · 70 aa

ProtGPT2 generated protein, 68 residues — ProtGPT2 · 68 aa

ProtGPT2 generated protein, 115 residues — ProtGPT2 · 115 aa

ProtGPT2 generated protein, 142 residues — ProtGPT2 · 142 aa

ProtGPT2 generated protein, 236 residues — ProtGPT2 · 236 aa

ProtGPT2 generated protein, 245 residues — ProtGPT2 · 245 aa

ProtGPT2 generated protein, 253 residues — ProtGPT2 · 253 aa

ProtGPT2 generated protein, 262 residues — ProtGPT2 · 262 aa

ProtGPT2 generated protein, 72 residues — ProtGPT2 · 72 aa

Limitations & open questions

$U$ is an AlphaFold surrogate, not wet-lab validation. We take pLDDT/pTM as a proxy for foldability; experimental validation of UCCS-generated proteins is future work.
Optimal α and L are model-dependent. We provide defaults but practitioners should sweep on a small held-out pool.
We evaluate on sequences ≤ 1024 residues. Long-range repetition (IDRs, low-complexity regions, phase-separation motifs) is untested — and explicitly motivating follow-up work.
UCCS is a pure intervention at representation level. We do not fine-tune the model, and we do not mix with sampling-time penalties; combinations are left to future work.

Cite

@misc{zhang2026controllingrepetitionproteinlanguage,
  title={Controlling Repetition in Protein Language Models},
  author={Jiahao Zhang and Zeqing Zhang and Di Wang and Lijie Hu},
  year={2026},
  eprint={2602.00782},
  archivePrefix={arXiv},
  primaryClass={q-bio.BM},
  url={https://arxiv.org/abs/2602.00782},
}

TL;DR

Repetition is a structural failure

Decoding penalties are not enough

Utility-matched contrastive steering wins

The problem: PLMs repeat, and repetition breaks folding

Repetition score R(x)

Utility U(x)

Method — UCCS

Formulation

Representation φL

Dataset construction

Strength α

Injection layer L

Main results

Ablations — the three knobs

Beyond the main paper

Extra models — autoregressive and diffusion

Where does repetition live? — initial mechanism analysis

Neuron-level correlation

Layer-wise linear probes

Generated proteins with UCCS

Limitations & open questions

Cite

Representation φ^L