UCSB + Apple collaborative research

Length Value Model

Scalable value pretraining for token-level length modeling.

Length Value Model, or LenVM, treats remaining generation length as a value estimation problem. It predicts a bounded token-level signal during decoding, enabling precise length control, smooth performance-efficiency trade-offs, prompt-boundary length prediction, and interpretable generation dynamics.

Paper Code Demo Cite

Annotation-free Dense Unbiased Scalable

LenVM architecture diagram showing the training pipeline with scalar head predicting values in the range of negative one to zero — Architecture & Training Pipeline

At each decoding step, LenVM attaches a scalar head to the final hidden state and predicts a value in \( (-1, 0) \), giving a token-level estimate of the remaining generation horizon.

Authors

Research team and affiliations

Zhen Zhang¹ Changyi Yang^2,4 Zijie Xia⁴ Zhen Yang⁵ Chengzhi Liu¹ Zhaotiao Weng¹ Yepeng Liu¹ Haobo Chen¹ Jin Pan^3,4 Chenyang Zhao⁴ Yuheng Bu¹ Alkesh Patel⁵ Zhe Gan⁵ Xin Eric Wang¹

¹ University of California, Santa Barbara ² Carnegie Mellon University ³ University of Wisconsin-Madison ⁴ LMSYS Org ⁵ Apple

Correspondence: zhen_zhang@ucsb.edu, ericxwang@ucsb.edu

Method

A small scalar head turns generation length into a controllable value signal.

Modeling length as a discounted return

LenVM starts from a constant negative reward at each non-terminal decoding step:

\[ r_t = -(1-\gamma), \qquad t = 0, \dots, L-1 \]

For a sampled completion of generated length \(L\), the discounted return from step \(t\) is:

\[ G_t \triangleq \sum_{i=0}^{L-t} \gamma^i r_{t+i} = -(1-\gamma^{L-t}) \]

This makes remaining generation length a bounded and monotone proxy of the remaining horizon, instead of regressing directly on raw token count.

Value head and objective

A scalar value head is attached to the final-layer hidden state at each decoding step:

\[ z_t = W_2 \, \mathrm{SiLU}(W_1 h_t + b_1) + b_2 \]

\[ V_\theta(s_t) = -\sigma(z_t) \]

Training uses token-averaged mean squared error over sampled prompt-completion trajectories:

\[ \mathcal{L}_{\mathrm{len}} = \frac{ \sum_{n=1}^{N}\sum_{t=0}^{L^{(n)}-1} \left(V_\theta(s_t^{(n)}) - G_t^{(n)}\right)^2 }{ \sum_{n=1}^{N} L^{(n)} } \]

Supervision is dense because every non-terminal token contributes a target.

Annotation-free

Targets are computed directly from observed completion lengths, without any extra human labels or reward models.

Dense

Every token position contributes supervision, rather than producing only one target for an entire response.

Exact

Rewards and returns are deterministically computed from the realized trajectory, avoiding additional annotation noise.

Scalable

Supervision grows naturally with more prompts and more sampled completions per prompt, and scales with larger models.

Results

The page does not hinge on one number. The picture is broader.

LenVM consistently improves exact length matching, exposes a better performance-efficiency frontier, predicts generation horizon from the prompt boundary, and scales cleanly as supervision grows.

Length Control

Open models become far more precise on exact target lengths.

On LIFEBench Equal To, LenVM lifts Qwen2.5-7B-Instruct from 30.9 to 64.8 length score.

Trade-Off

Shorter successful trajectories can be uncovered without changing the base model.

On GSM8K around 200 tokens, LenVM keeps about 63% Pass@1 while the hard budget baseline is about 6%.

Prediction

Prompt-boundary length estimation becomes accurate enough to be operationally useful.

At 32B, mean relative error falls to 9.8% on math, 14.9% on code, and 17.1% on instruction following.

Training Mixture

Datasets used to train general LenVMs

Domain	Dataset	Scale
Code	OpenCodeReasoning-2 (Python)	1.42M
Instruction Following	WildChat	529k
Math	DeepMath-103K	103k

LIFEBench

Length-controlled generation under Equal To, At Most, and At Least constraints

Model	Equal To Deviation ↓	Equal To Score ↑	At Most Score ↑	At Least Score ↑
Closed-source frontier models
GPT-4o	74%	35.5	77.9	98.5
GPT-5.4	135%	37.4	65.4	98.9
GPT-5.4-thinking	131%	47.8	72.7	98.9
Claude-Sonnet-4-6	105%	34.1	62.9	100.0
Claude-Sonnet-4-6-thinking	124%	51.3	69.3	100.0
Claude-Opus-4-6	66%	35.5	51.5	100.0
Claude-Opus-4-6-thinking	87%	53.2	67.4	100.0
Gemini-3-Flash-Preview	123%	40.3	57.3	99.6
Gemini-3.1-Pro-Preview	91%	49.3	70.7	100.0
Open models with LenVM guidance
Qwen2.5-3B-Instruct	83%	25.6	92.1	94.6
+ LenVM (1.5B)	56%	62.6	93.0	93.1
Qwen2.5-7B-Instruct	71%	30.9	98.5	89.1
+ LenVM (1.5B)	44%	64.8	96.1	99.5
Qwen3-30B-A3B-Instruct	90%	36.8	87.0	99.3
+ LenVM (1.7B)	57%	67.2	99.4	99.8

GSM8K performance-efficiency trade-off curve comparing LenVM-guided decoding versus hard truncation — GSM8K: LenVM-guided decoding traces a stronger frontier than hard truncation.

MATH500 trade-off curve showing shorter responses preserving task performance — MATH500: shorter responses can preserve far more task performance.

MathVista trade-off curve demonstrating value signal transfer to VLM setting — MathVista: the same value signal transfers to a VLM setting.

Prompt-Boundary Prediction

Mean Relative Error on length estimation

Model Size	Math ↓	Code ↓	IF ↓
1.5B	17.0%	29.0%	33.0%
3B	13.6%	24.0%	27.2%
7B	11.0%	19.5%	23.0%
14B	10.4%	17.0%	19.8%
32B	9.8%	14.9%	17.1%

Analysis

LenVM is useful not only for control, but also for reading generation dynamics.

Word cloud visualization of positive length tokens including wait, think, try, and consider — Positive length tokens tend to mark longer-horizon shifts such as "wait", "think", "try", "Ah", and "consider".

Word cloud visualization of negative length tokens associated with closure and finalization — Negative length tokens often align with closure and answer finalization.

Beyond prediction and control, LenVM also offers a qualitative view of where generation shifts toward longer or shorter continuations. We analyze tokens that repeatedly co-occur with upward or downward changes in LenVM's predicted remaining horizon, and refer to them as length tokens. Using the one-step TD-style score \( s_t = r_{t-1} + \gamma V_t - V_{t-1} \), positive values indicate a shift toward a longer expected continuation, while negative values indicate a shift toward a shorter one.

Positive length tokens often resemble local reasoning pivots such as "ah", "but", "now", "wait", "let", "think", "try", and "consider". By contrast, negative length tokens are more often associated with closure, confirmation, or answer finalization. This analysis is descriptive rather than causal, but it shows that LenVM provides a simple token-level lens on generation dynamics.

Scalability

Value pretraining quality improves predictably as scale increases.

One of the main claims of the paper is that length supervision is naturally scalable. Validation loss improves not only with larger backbone models, but also with more training questions and more sampled completions per question, showing that the objective benefits from both data breadth and trajectory density.

Graph showing validation loss decreasing as model size increases — Validation loss drops with larger models.

Graph demonstrating improvement in value objective with more training questions — More training questions improve the value objective.

Graph showing better supervision quality with more sampled completions per prompt — More sampled completions per prompt provide more useful supervision.

Ablations

Design choices matter, but the framing remains stable.

Ablation study showing shuffle batching outperforming grouped batching — Shuffling beats grouped batching.

Ablation study demonstrating impact of numerical precision on prediction quality — Numerical precision has visible impact on prediction quality.

Paper

Citation and full PDF in one place.

BibTeX

@misc{zhang2026lengthvaluemodelscalable,
      title={Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling}, 
      author={Zhen Zhang and Changyi Yang and Zijie Xia and Zhen Yang and Chengzhi Liu and Zhaotiao Weng and Yepeng Liu and Haobo Chen and Jin Pan and Chenyang Zhao and Yuheng Bu and Alkesh Patel and Zhe Gan and Xin Eric Wang},
      year={2026},
      eprint={2604.27039},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.27039}, 
}

arXiv