agentic-ai

Tree Transformers: A step towards generalizing the transformer architecture

LessWrong · Jun 24, 2026, 2:50 AM · Also reported by 1 other source

After a billion architectures and a trillion variations, I finally found a transformer architecture that intrigued me. And this essay is step one towards the theory and the sub-field on which it was built [1].I.The representation of any dataset that we wish to approximate/predict/learn the distribution of, has converged to be vectors. This phenomena was not by accident, as vectors provide higher dimensional representations of the data, which is needed for mapping connections between data points, and also plays right into the strengths of a GPU, giving us incredibly parallelizable algorithms for deep learning. While these points are very true and have bore incredible fruits, we can always benefit by probing further.Every dataset has a unique internal structure [2] which our models must approximate, in order to learn and predict. This structure, in the process of transforming it into a vector representation is lost (or, is very costly to preserve).[3]What happens if, instead of transforming the dataset into a vector representation, we instead try to preserve what structure we have, and generalize the transformer in order to perform the operation on that structure itself, instead of vectors ?II.As the title suggests, the first data structure I chose to explore was the ubiquitous tree structure [4]. Trees appear more often then any other structure in computer science (except maybe graphs), and hence I chose to implement the self-attention mechanism, but for trees, before anything else.We first begin by defining a Node data structure, like so,but with a crucial difference: it’s stored value can be another node, recursively. This structure itself becomes the essence of the deep network structures we care about: as matrix is a vector of vectors, so can this structure give us a “tree of trees“. This is essential while defining the very important outer-product or matrix multiplication equivalent for the tree data structure.The next step would be to define three very importan

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Hacker News DiffusionBench: Towards Holistic Evaluation of Generative Diffusion Transformers

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Tree Transformers: A step towards generalizing the transformer architecture

Also covered by

More in agentic-ai