Scoopfeeds — Intelligent news, curated.
How Far Apart Does a Model Think Its Tokens Are?
agentic-ai

How Far Apart Does a Model Think Its Tokens Are?

LessWrong · Jun 7, 2026, 8:20 PM · Also reported by 1 other source

Instead of using static position increments (+1) per token, Ro PE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what the model thinks the distance is between each position and how this varies per-layer.Example sentence with each character plotted based on per-layer learned position increments. Note the clear punctuation-based boundaries in L0 and what looks like concept-based grouping in L3.I think this might be useful as another technique to inspect "where the model is looking" in addition to plotting attention patterns (and with similar limitations). The patterns can also hint at what the model is looking for at each layer (when position increments match different kinds of boundaries).Note: This is still partially a solution in search of a problem. I'm hoping to help with the "searching under lamp posts" problem by finding more lamp posts, but there's additional work to be done here to see if this is actually useful or just a novelty.AI disclaimer: The Architecture, Learned Position Increments, and Related Work sections were originally drafted by Claude before being (heavily) human-edited.IntroductionStandard LLMs use Rotary Position Embeddings (RoPE) to encode the location of each position by rotating the key and query vectors by angles proportional to the number of tokens between the two positions.Standard RoPE assumes that each token advances the position counter by +1, but we can train a model to advance the position counter by a learned increment per-token. Going further, we can learn a per-layer position increment vector, allowing us to calculate content-based position increments at any layer of the model.MethodArchitectureThe models are small decoder-only transformers — 256-dimensional, 8 heads, 6 layers, ~6.4M parameters, with RMSNorm, SwiGLU MLPs, and RoPE (θ = 10,000) — directly on raw UTF-8 bytes rather than BPE tokens. The vocabulary is 257 symbols: 25

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories

Also covered by

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop