UCSB + Apple collaborative research

Length Value Model

Scalable value pretraining for token-level length modeling.

Length Value Model, or LenVM, treats remaining generation length as a value estimation problem. It predicts a bounded token-level signal during decoding, enabling precise length control, smooth performance-efficiency trade-offs, prompt-boundary length prediction, and interpretable generation dynamics.

Annotation-free Dense Unbiased Scalable
LenVM architecture diagram showing the training pipeline with scalar head predicting values in the range of negative one to zero

Architecture & Training Pipeline

At each decoding step, LenVM attaches a scalar head to the final hidden state and predicts a value in \( (-1, 0) \), giving a token-level estimate of the remaining generation horizon.

Authors

Research team and affiliations

Zhen Zhang1 Changyi Yang2,4 Zijie Xia4 Zhen Yang5 Chengzhi Liu1 Zhaotiao Weng1 Yepeng Liu1 Haobo Chen1 Jin Pan3,4 Chenyang Zhao4 Yuheng Bu1 Alkesh Patel5 Zhe Gan5 Xin Eric Wang1
1 University of California, Santa Barbara 2 Carnegie Mellon University 3 University of Wisconsin-Madison 4 LMSYS Org 5 Apple

Method

A small scalar head turns generation length into a controllable value signal.

Modeling length as a discounted return

LenVM starts from a constant negative reward at each non-terminal decoding step:

\[ r_t = -(1-\gamma), \qquad t = 0, \dots, L-1 \]

For a sampled completion of generated length \(L\), the discounted return from step \(t\) is:

\[ G_t \triangleq \sum_{i=0}^{L-t} \gamma^i r_{t+i} = -(1-\gamma^{L-t}) \]

This makes remaining generation length a bounded and monotone proxy of the remaining horizon, instead of regressing directly on raw token count.

Value head and objective

A scalar value head is attached to the final-layer hidden state at each decoding step:

\[ z_t = W_2 \, \mathrm{SiLU}(W_1 h_t + b_1) + b_2 \]
\[ V_\theta(s_t) = -\sigma(z_t) \]

Training uses token-averaged mean squared error over sampled prompt-completion trajectories:

\[ \mathcal{L}_{\mathrm{len}} = \frac{ \sum_{n=1}^{N}\sum_{t=0}^{L^{(n)}-1} \left(V_\theta(s_t^{(n)}) - G_t^{(n)}\right)^2 }{ \sum_{n=1}^{N} L^{(n)} } \]

Supervision is dense because every non-terminal token contributes a target.

Annotation-free

Targets are computed directly from observed completion lengths, without any extra human labels or reward models.

Dense

Every token position contributes supervision, rather than producing only one target for an entire response.

Exact

Rewards and returns are deterministically computed from the realized trajectory, avoiding additional annotation noise.

Scalable

Supervision grows naturally with more prompts and more sampled completions per prompt, and scales with larger models.

Results

The page does not hinge on one number. The picture is broader.

LenVM consistently improves exact length matching, exposes a better performance-efficiency frontier, predicts generation horizon from the prompt boundary, and scales cleanly as supervision grows.

Length Control

Open models become far more precise on exact target lengths.

On LIFEBench Equal To, LenVM lifts Qwen2.5-7B-Instruct from 30.9 to 64.8 length score.

Trade-Off

Shorter successful trajectories can be uncovered without changing the base model.

On GSM8K around 200 tokens, LenVM keeps about 63% Pass@1 while the hard budget baseline is about 6%.

Prediction

Prompt-boundary length estimation becomes accurate enough to be operationally useful.

At 32B, mean relative error falls to 9.8% on math, 14.9% on code, and 17.1% on instruction following.

Training Mixture

Datasets used to train general LenVMs

Domain Dataset Scale
Code OpenCodeReasoning-2 (Python) 1.42M
Instruction Following WildChat 529k
Math DeepMath-103K 103k

LIFEBench

Length-controlled generation under Equal To, At Most, and At Least constraints

Model Equal To Deviation ↓ Equal To Score ↑ At Most Score ↑ At Least Score ↑
Closed-source frontier models
GPT-4o74%35.577.998.5
GPT-5.4135%37.465.498.9
GPT-5.4-thinking131%47.872.798.9
Claude-Sonnet-4-6105%34.162.9100.0
Claude-Sonnet-4-6-thinking124%51.369.3100.0
Claude-Opus-4-666%35.551.5100.0
Claude-Opus-4-6-thinking87%53.267.4100.0
Gemini-3-Flash-Preview123%40.357.399.6
Gemini-3.1-Pro-Preview91%49.370.7100.0
Open models with LenVM guidance
Qwen2.5-3B-Instruct83%25.692.194.6
+ LenVM (1.5B)56%62.693.093.1
Qwen2.5-7B-Instruct71%30.998.589.1
+ LenVM (1.5B)44%64.896.199.5
Qwen3-30B-A3B-Instruct90%36.887.099.3
+ LenVM (1.7B)57%67.299.499.8
GSM8K performance-efficiency trade-off curve comparing LenVM-guided decoding versus hard truncation
GSM8K: LenVM-guided decoding traces a stronger frontier than hard truncation.
MATH500 trade-off curve showing shorter responses preserving task performance
MATH500: shorter responses can preserve far more task performance.
MathVista trade-off curve demonstrating value signal transfer to VLM setting
MathVista: the same value signal transfers to a VLM setting.

Prompt-Boundary Prediction

Mean Relative Error on length estimation

Model Size Math ↓ Code ↓ IF ↓
1.5B17.0%29.0%33.0%
3B13.6%24.0%27.2%
7B11.0%19.5%23.0%
14B10.4%17.0%19.8%
32B9.8%14.9%17.1%

Analysis

LenVM is useful not only for control, but also for reading generation dynamics.

Word cloud visualization of positive length tokens including wait, think, try, and consider
Positive length tokens tend to mark longer-horizon shifts such as "wait", "think", "try", "Ah", and "consider".
Word cloud visualization of negative length tokens associated with closure and finalization
Negative length tokens often align with closure and answer finalization.

Beyond prediction and control, LenVM also offers a qualitative view of where generation shifts toward longer or shorter continuations. We analyze tokens that repeatedly co-occur with upward or downward changes in LenVM's predicted remaining horizon, and refer to them as length tokens. Using the one-step TD-style score \( s_t = r_{t-1} + \gamma V_t - V_{t-1} \), positive values indicate a shift toward a longer expected continuation, while negative values indicate a shift toward a shorter one.

Positive length tokens often resemble local reasoning pivots such as "ah", "but", "now", "wait", "let", "think", "try", and "consider". By contrast, negative length tokens are more often associated with closure, confirmation, or answer finalization. This analysis is descriptive rather than causal, but it shows that LenVM provides a simple token-level lens on generation dynamics.

Scalability

Value pretraining quality improves predictably as scale increases.

One of the main claims of the paper is that length supervision is naturally scalable. Validation loss improves not only with larger backbone models, but also with more training questions and more sampled completions per question, showing that the objective benefits from both data breadth and trajectory density.

Graph showing validation loss decreasing as model size increases
Validation loss drops with larger models.
Graph demonstrating improvement in value objective with more training questions
More training questions improve the value objective.
Graph showing better supervision quality with more sampled completions per prompt
More sampled completions per prompt provide more useful supervision.

Ablations

Design choices matter, but the framing remains stable.

Ablation study comparing length-space representation against alternative formulations
Length-space representation is the strongest target formulation.
Ablation study showing shuffle batching outperforming grouped batching
Shuffling beats grouped batching.
Ablation study demonstrating impact of numerical precision on prediction quality
Numerical precision has visible impact on prediction quality.

Paper

Citation and full PDF in one place.

BibTeX

@article{zhang2026lengthvaluemodel,
  title   = {Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling},
  author  = {Zhen Zhang and Changyi Yang and Zijie Xia and Zhen Yang and Chengzhi Liu and Zhaotiao Weng and Yepeng Liu and Haobo Chen and Jin Pan and Chenyang Zhao and Yuheng Bu and Alkesh Patel and Zhe Gan and Xin Eric Wang},
  year    = {2026},
  journal = {arXiv preprint arXiv}
}