2026-06-13 at

why do we use QKV and not fewer parameters?

Zoom out for a second. These are all just parameters - any attention block in any transformer layer can have an arbitrary number of parameters, wired up in information preserving ways.

Beause the overall transformer is trained iteratively, the system will store information somehow in whatever parameters are available without prejudice. If you give it one param, it will use it, if you give it ten, ditto. 

The net architecture is simply a decision of how much memory to allocate to internal nodes in a multinode NN. 

No comments :

Post a Comment