ICLR 2026 · Poster

Controlling Repetition in Protein Language Models

Jiahao Zhang1,* Zeqing Zhang1,* Di Wang2 Lijie Hu1,†
1 MBZUAI 2 KAUST * equal contribution  corresponding author: lijie.hu@mbzuai.ac.ae
MBZUAI logo KAUST logo

TL;DR

Protein language models frequently collapse into long repeats (AAAAAA) and short motif loops (AGAGAG). These sequences do not fold. We quantify this failure, and introduce UCCSUtility-Controlled Contrastive Steering — a training-free method that reduces repetition while simultaneously raising AlphaFold confidence across ESM-3 and ProtGPT2.

01

Repetition is a structural failure

Motif-level and homopolymer collapse in PLMs tracks AlphaFold pLDDT, not just token diversity.

02

Decoding penalties are not enough

Temperature, top-p, and repetition-penalty either barely move repetition or lower foldability.

03

Utility-matched contrastive steering wins

UCCS is the only method that improves both repetition and utility on all tested datasets and models.

The problem: PLMs repeat, and repetition breaks folding

Raw ESM-3 / ProtGPT2 generations often collapse into low-complexity sequences that AlphaFold cannot confidently fold. Natural proteins (top row) vs. unmodified model outputs (bottom rows). Color = per-residue pLDDT (blue ≈ 100, orange/red < 50).

Natural
proteins
Natural CATH protein, 128 residues, high pLDDT throughout
CATH · 128 aa
Natural CATH protein, 256 residues
CATH · 256 aa
Natural CATH protein, 513 residues
CATH · 513 aa
ESM-3
raw output
ESM-3 raw output, 126 residues, orange/red low pLDDT regions
ESM-3 · 126 aa
ESM-3 raw output, 256 residues
ESM-3 · 256 aa
ESM-3 raw output, 512 residues
ESM-3 · 512 aa
ProtGPT2
raw output
ProtGPT2 raw output, 128 residues
ProtGPT2 · 128 aa
ProtGPT2 raw output, 256 residues
ProtGPT2 · 256 aa
ProtGPT2 raw output, 511 residues
ProtGPT2 · 511 aa

Repetition score R(x)

  • Hnorm — normalized unigram entropy (global balance)
  • Distinct-2/3 — 2-/3-gram diversity (motif loops)
  • Rhpoly — homopolymer penalty (runs ≥ 4)

Higher R(x) = less repetitive.

Utility U(x)

AlphaFold-derived structural confidence:

$U(x) \;=\; \tfrac{1}{2}\bigl(\mathrm{pLDDT}(x) + \mathrm{pTM}(x)\bigr)$

Higher U(x) = more confident fold.

The core obstacle. On real datasets, low-repetition and high-utility are entangled. Naively steering "away from repetition" usually also pushes the model away from foldability — because the direction you learn from raw contrastive data is a mixture of both.

Method — UCCS

UCCS builds contrastive sets that are matched in utility and separated in repetition, so the mean-difference vector isolates a pure "repetition direction" in hidden space.

UCCS pipeline: score candidates by (R,U); filter to a utility band; contrastive selection maximizing ΔR with ΔU bounded; compute steering vector v^L; inject at inference.
(a) Score a candidate pool by $(R, U)$. (b) Filter to a utility band and maximise $\Delta R$ subject to $\Delta U \le \epsilon$ (Pareto / composite ranking). (c) Take the mean-difference vector $v^L = \mathbb{E}_{x \in \mathcal{D}^+}[\phi^L(x)] - \mathbb{E}_{x \in \mathcal{D}^-}[\phi^L(x)]$ at a chosen layer. (d) At inference, inject $\tilde h^L = h^L + \alpha \, v^L$.

Formulation

$$\min_{f}\; R\!\left(f(M, p)\right) \quad \text{s.t.}\quad U\!\left(f(M, p)\right) \;\ge\; U(M, p) - \epsilon$$

We do not retrain $M$. Instead we modify its forward pass by adding a fixed direction to hidden activations at a chosen layer.

Representation φL

$$\phi^L(x) \;=\; \begin{cases} \tfrac{1}{T}\sum_{t=1}^T h^L_t(x) & \text{MLM (ESM-3, mean pool)}\\[4pt] h^L_T(x) & \text{AR-LM (ProtGPT2, last token)} \end{cases}$$

MLMs encode bidirectionally, so mean-pool; AR-LMs concentrate predictive information at the last token.

Dataset construction

Pareto or composite selection over ~10 k candidates → ~100 matched pairs. Utility matching is what makes the steering direction clean.

Strength α

Default $\alpha = 1$. Unimodal across $[0.5, 4]$. Larger $\alpha$ emphasises $R$; smaller $\alpha$ protects $U$.

Injection layer L

Late layers work best. ESM-3: $L \approx 46/48$. ProtGPT2: $L \approx 30/36$. Early-layer injection is inert or harmful.

Main results

Across CATH, UniRef50 and SCOP, on both ESM-3 (MLM) and ProtGPT2 (AR-LM), UCCS is the only method that improves both repetition $R$ and utility $U$ relative to the unmodified baseline.

R vs U scatter for ESM-3 on CATH, unconditional generation, showing UCCS in the top-right quadrant above all baselines.
ESM-3 · CATH · unconditional. UCCS (top-right) is the only method above both $R_{\mathrm{orig}}$ and $U_{\mathrm{orig}}$.
+52 % repetition score $R$ 0.423 → 0.645
+13 % utility $U$ 0.576 → 0.652

Temperature sampling raises $R$ to 0.751 but drops $U$ to 0.566 — the trade-off UCCS breaks.

MethodR ↑U ↑
Original (no intervention)0.4230.576
Temperature sampling0.7510.566
Top-p sampling0.4190.590
Entropy-based sampling0.9040.477
Neuron deactivation0.5510.507
Probe steering0.4150.587
UCCS (ours)0.6450.652
ESM-3, CATH, unconditional. ✓ marks $U \ge U_{\mathrm{orig}}$.

Same pattern holds for UniRef50 and SCOP, and for conditional (prefix-10) generation — see the paper's Tables 1 and 2 for all 12 settings.

Ablations — the three knobs

A. Utility-matched dataset sampling

Pareto and composite selection beat random sampling — both lift $R$ and $U$ and shrink variance across seeds. Without utility matching, the mean-difference vector absorbs foldability signal too, and UCCS collapses to a decoding-penalty-like trade-off.

B. Steering strength α
ESM-3 alpha ablation: unimodal curve peaking near alpha=1-2
ESM-3
ProtGPT2 alpha ablation
ProtGPT2

$\alpha$ is clearly unimodal in $[0.5, 4]$. Default $\alpha = 1$ gives the best $R$–$U$ harmonic mean. Conditional generation is noticeably more robust — a wider flat top.

C. Injection layer L
ESM-3 layer ablation: late layers win
ESM-3 (48 layers)
ProtGPT2 layer ablation
ProtGPT2 (36 layers)

Injection in early layers is inert or harmful; late-layer injection (ESM-3 ≈ L46/48; ProtGPT2 ≈ L30/36) is monotonic-improving up to the penultimate layer — consistent with repetition features concentrating late in depth.

Beyond the main paper

The appendix extends the study with (i) two additional protein language models spanning a new generation paradigm, and (ii) an initial mechanistic analysis of where "repetition" lives inside PLM activations.

Extra models — autoregressive and diffusion

ProGen2-base autoregressive

Same baselines as ProtGPT2 (temperature, top-$p$, no-repeat-$n$-gram, repetition penalty). Across CATH / UniRef50 / SCOP, decoding heuristics barely move $R$ in the unconditional setting, while UCCS delivers consistent gains and is the only method satisfying the utility constraint across every condition. The latent repetition direction generalises to ProGen2's distinct architecture and training corpus.

DPLM-650M diffusion

A completely different generation mechanism. We sweep diffusion-specific knobs (sampling strategy, resample ratio, internal-resample on/off) — they trade $R$ and $U$ in unstable ways. UCCS still wins: $R = 0.863\text{–}0.879$ (unconditional) and $0.881\text{–}0.894$ (conditional), with $U$ improved over the original model. The method transfers across AR, MLM, and diffusion PLMs.

Where does repetition live? — initial mechanism analysis

Neuron-level correlation

Pearson correlation between each neuron's activation and the repetition score, across all layers. Distributions are centered near zero with no isolated high-correlation outliers — repetition is not driven by a small set of specialised units.

Takeaway. Repetition is a distributed representational pattern → aligns with a direction/subspace in activation space, not individual neurons. This is exactly the object UCCS isolates via the mean-difference vector.

ESM-3 neuron-repetition correlation histogram across layers; bimodal in several layers
ESM-3 — broader, often bimodal
ProtGPT2 neuron-repetition correlation histogram; narrow and unimodal
ProtGPT2 — narrower, unimodal

ESM-3's broader, more polarised distribution matches its empirically more severe repetition: architecture modulates the strength of the repetition subspace.

Layer-wise linear probes

Linear probes trained to predict the repetition score from hidden activations at each layer. Probe performance grows sharply in later layers — the model's late-layer geometry encodes repetition most explicitly.

Takeaway. Same layers that probes find most informative are the layers where UCCS injection works best (ESM-3 $L \approx 46$, ProtGPT2 $L \approx 30$). The ablation in §Ablations is not a coincidence — it's consistent with the representational geometry.

ESM-3 layer-wise probe metrics, rising in late layers
ESM-3
ProtGPT2 layer-wise probe metrics
ProtGPT2 — same pattern, weaker magnitude

Full tables for ProGen2 / DPLM and detailed mechanism figures are in the paper appendix.

Limitations & open questions

Cite

@misc{zhang2026controllingrepetitionproteinlanguage,
  title={Controlling Repetition in Protein Language Models},
  author={Jiahao Zhang and Zeqing Zhang and Di Wang and Lijie Hu},
  year={2026},
  eprint={2602.00782},
  archivePrefix={arXiv},
  primaryClass={q-bio.BM},
  url={https://arxiv.org/abs/2602.00782},
}