01

Motivation — Repetition Collapses the Fold

PLMs like ESM-3 and ProtGPT2 generate sequences that collapse into homopolymers or motif loops. Unlike text repetition, these destroy the predicted fold: AlphaFold pLDDT drops from 85+ to under 50 across repetitive regions.

CATH natural protein
CATH
natural · 256 aa
UniRef50 natural protein
UniRef50
natural · 256 aa
ESM-3 generation collapsing into homopolymer
ESM-3
homopolymer · 256 aa
ProtGPT2 generation with motif loop
ProtGPT2
motif loop · 128 aa

Blue = pLDDT > 90 (confident); orange/red = < 70 (low). Repetitive residues fold poorly.

02

Measuring Repetition · R(x) and U(x)

We unify three complementary signals into a single repetition score in [0, 1] (higher = less repetition):

Hnorm
Token entropy, catches global collapse.
Distinct-n
n = 2, 3 — catches motif loops.
Rhpoly
Penalizes runs of ≥ 4 identical residues.
R(x) = ⅓(Hnorm + ½(D2+D3) + Rhpoly)  ·  U(x) = ½(pLDDT / 100 + pTM)

Across ESM-3 and ProtGPT2 on CATH, UniRef50, SCOP: low R reliably predicts low U — repetition is not cosmetic.

↑R · less repetition    ↑U · better fold
03

Why not just penalize repetition?

Every decoding penalty or mechanism edit we tested either barely moves R or trades R for a large U drop. No baseline simultaneously improves both.

Method family RUjointly ↑?
Temperature / top-p
Entropy-based sampling ↑↑↓↓
Repetition / n-gram penalty
Neuron deactivation
Probe steering
UCCS (ours)
04

UCCS · Utility-Controlled Contrastive Steering

UCCS extracts a single steering vector from a utility-matched contrastive dataset and injects it at inference — no retraining, no decoding penalty.

UCCS method overview: utility-controlled contrastive pairs produce a layer-L steering vector injected at inference
STEP 1
Utility-matched pairs
Build 𝒟⁺ (low-R) and 𝒟⁻ (high-R) so that ΔU ≤ ε while ΔR is maximized. Pareto / composite selection beats random.
(𝒟⁺, 𝒟⁻) = argmax ΔR   s.t.   ΔU ≤ ε
STEP 2
Mean-difference vector
Pool per-layer representations — mean for MLMs, last-token for AR-LMs. One direction per layer:
vL = 𝔼𝒟⁺L(x)] − 𝔼𝒟⁻L(x)]
STEP 3
Plug-and-play injection
Add α·vL to hidden states at inference. α ≈ 1 works across backbones.
tL(x) = htL(x) + α · vL
Why utility control matters. Without ΔU≤ε, vL captures "natural vs generated" — not repetition. That is exactly why probe steering (the un-controlled version) fails to improve R. On D⁺/D⁻ pools, R metrics separate sharply while U distributions align, confirming the disentanglement.
06

Main Results · ESM-3, unconditional, CATH

+52%
R ↑ vs Original
+13%
U ↑ vs Original
U ≥ Uorig held
Method R ↑U ↑U ≥ U₀?
Original 0.4230.576
Temperature 0.7510.566
Entropy sampling 0.9040.477
Neuron deactivation 0.5510.507
Probe steering 0.4150.587
UCCS (ours) 0.6450.652
ESM-3 R vs U scatter across CATH, UniRef50, SCOP unconditional; UCCS dominates the Pareto frontier
  • UCCS
  • Original
  • Temperature
  • Top-p
  • Entropy
  • Neuron deact.
  • Probe steer.
07

Ablations · Dataset · α · Layer

Dataset sampling ablation: Pareto and Composite beat random
(a) Dataset · Pareto > random
Alpha ablation: unimodal around 1-1.5
(b) α · peak ≈ 1–1.5
Layer ablation: late layers dominate
(c) Layer · last 3–6 %