2026-05-23 at

current frontier LLM internals ( broadly )

TIL current frontier "transformer" LLMs ... 

1. 

The terms "encoder, decoder" apparently, don't apply to the encoding of end-user queries. "Encoder-only" and "encoder-decoder" LLMs thus refer to entirely other things.

Frontier LLMs are "decoder-only" text generators, but they DO ENCODE the user's query into "token" space. 

Simplifying the dumb parts : the following model process applies to any data, sensory or verbal - all those inputs would be first translated into "tokens", and input the same way. After processing, all results are spat out into the target (sensory or verbal) human language. 

2.

Frontier "decoder-only" models function "autoregressively", they "produce results based on historical data". The important part of the model's thought process is broadly - grab a PENCIL and SKETCH it out - 2A looks something like this :

Input -->[ 2A1 --> 2A2 ] times, ( 80 to 120 minus skipped ) layers --> Output

2A.

Input data activates a process of 80 to 120 sequential "layers". ( Fortunately at this point in history, BRANCHING LOGIC allows some to be skipped. ) Each layer is internally composed of, 

2A1.

... an initial "attention" network, of PARALLELISED weight-holding memory cells ( stacked to varying depth ), terminating in 64 to 128 "attention heads" ( architectural variations : MHA, GQA, MQA, MLA, etc. ), whose purpose is to CLUSTER the input data, followed by

2A2.

... a subsequent "feed forward network" block, of BRANCHING weight-holding memory cells ( stacked to varying depth ), whose purpose is to LOGICALLY JUDGE the input data.

3.

During the model's "training" period, each loop through all the layers of 2A "feeds forward" data through all ( 80 to 120 ) layers in sequence, and from 2A1 to 2A2 within every layer, then checks the correctness of the result, then "backpropagates" corrections to the weights of 2A1 and 2A2 in all ( 80 to 120 minus ) layers - which make future "feed forwards" more correct. Many cycles happen, to maximise correctness.

4.

During the model's "inference" period, a "feed forward" loop through all ( 80 to 120 minus skipped ) layers of 2A happens ONCE FOR EVERY SINGLE WORD GENERATED. Like this ...

[ Query ] -> Loop1 -> adds word1

[ Query + word1 ] -> Loop2 -> adds word2

[ Query + word1 + word2 ] -> LoopN -> adds wordN

Can you see how incredibly stupid this is?

It would (should?) be much simpler to have a model that actually understands the query in terms of a sensory-spatiotemporal model, generates an answer in the same space ( which it then reads just once), and then outputs just once into the target language. Maybe this is something JEPA, and future models will fix.

5.

Because 2A1 is just about clustering the data and not judging it, all 64 to 128 attention heads and their network cells are hit, during any "feed forward" loop, unless a layer is skipped.

6.

But because 2A2 allows branching, over 90% of cells can be skipped during each "inference" loop, though none can be skipped during "training" loops.

No comments :

Post a Comment