Ah, if each vector is normalized to unit length then this makes sense. But, there is no way to constrain the vector component values during gradient descent so that the vectors stay at unit length. Or is it that each vector is normalized every time we compute the dot product? I understand wanting the vectors to point in the same direction, but the vector magnitude seems like a distraction.

I know that model parameters needn't have a human-understandable interpretation (cf. hidden layers of neural networks), but

*if* they do, it helps to see that the intuition makes sense