Controlling Repetition in Protein Language Models

Motivation — Repetition Collapses the Fold

PLMs like ESM-3 and ProtGPT2 generate sequences that collapse into homopolymers or motif loops. Unlike text repetition, these destroy the predicted fold: AlphaFold pLDDT drops from 85+ to under 50 across repetitive regions.

CATH
natural · 256 aa

UniRef50
natural · 256 aa

ESM-3 generation collapsing into homopolymer

ESM-3
homopolymer · 256 aa

ProtGPT2
motif loop · 128 aa

Blue = pLDDT > 90 (confident); orange/red = < 70 (low). Repetitive residues fold poorly.

Measuring Repetition · R(x) and U(x)

We unify three complementary signals into a single repetition score in [0, 1] (higher = less repetition):

H_norm

Token entropy, catches global collapse.

Distinct-n

n = 2, 3 — catches motif loops.

R_hpoly

Penalizes runs of ≥ 4 identical residues.

R(x) = ⅓(H_norm + ½(D₂+D₃) + R_hpoly) · U(x) = ½(pLDDT / 100 + pTM)

Across ESM-3 and ProtGPT2 on CATH, UniRef50, SCOP: low R reliably predicts low U — repetition is not cosmetic.

↑R · less repetition ↑U · better fold

Why not just penalize repetition?

Every decoding penalty or mechanism edit we tested either barely moves R or trades R for a large U drop. No baseline simultaneously improves both.

Method family	R	U	jointly ↑?
Temperature / top-p	≈	↓	✗
Entropy-based sampling	↑↑	↓↓	✗
Repetition / n-gram penalty	↑	↓	✗
Neuron deactivation	↑	↓	✗
Probe steering	≈	≈	✗
UCCS (ours)	↑	↑	✓

UCCS · Utility-Controlled Contrastive Steering

UCCS extracts a single steering vector from a utility-matched contrastive dataset and injects it at inference — no retraining, no decoding penalty.

UCCS method overview: utility-controlled contrastive pairs produce a layer-L steering vector injected at inference

STEP 1

Utility-matched pairs

Build 𝒟⁺ (low-R) and 𝒟⁻ (high-R) so that ΔU ≤ ε while ΔR is maximized. Pareto / composite selection beats random.

(𝒟⁺, 𝒟⁻) = argmax ΔR s.t. ΔU ≤ ε

STEP 2

Mean-difference vector

Pool per-layer representations — mean for MLMs, last-token for AR-LMs. One direction per layer:

v^L = 𝔼_𝒟⁺[φ^L(x)] − 𝔼_𝒟⁻[φ^L(x)]

STEP 3

Plug-and-play injection

Add α·v^L to hidden states at inference. α ≈ 1 works across backbones.

h̃_t^L(x) = h_t^L(x) + α · v^L

Why utility control matters. Without ΔU≤ε, v^L captures "natural vs generated" — not repetition. That is exactly why probe steering (the un-controlled version) fails to improve R. On D⁺/D⁻ pools, R metrics separate sharply while U distributions align, confirming the disentanglement.

Main Results · ESM-3, unconditional, CATH

+52%

R ↑ vs Original

+13%

U ↑ vs Original

✓

U ≥ U_orig held

Method	R ↑	U ↑	U ≥ U₀?
Original	0.423	0.576	—
Temperature	0.751	0.566	✗
Entropy sampling	0.904	0.477	✗
Neuron deactivation	0.551	0.507	✗
Probe steering	0.415	0.587	✓
UCCS (ours)	0.645	0.652	✓

ESM-3 R vs U scatter across CATH, UniRef50, SCOP unconditional; UCCS dominates the Pareto frontier

UCCS
Original
Temperature
Top-p
Entropy
Neuron deact.
Probe steer.

Ablations · Dataset · α · Layer

Dataset sampling ablation: Pareto and Composite beat random

(a) Dataset · Pareto > random

(b) α · peak ≈ 1–1.5

UCCS Generations · Foldable & Diverse Across Backbones

ESM-3 (MLM) ProtGPT2 (AR) · All examples pLDDT-colored. UCCS produces diverse, high-confidence folds across lengths and both backbones — not collapsing into low-complexity bundles.