Originally Posted by ilya239
Ah, if each vector is normalized to unit length then this makes sense. But, there is no way to constrain the vector component values during gradient descent so that the vectors stay at unit length. Or is it that each vector is normalized every time we compute the dot product? I understand wanting the vectors to point in the same direction, but the vector magnitude seems like a distraction.

The vectors are not normalized, at least not deliberately. The argument was only meant to motivate that the inner product has a matching aspect. However, even if we consider the magnitude to be a distraction, the learning algorithm has the opportunity to keep the magnitude fixed if that helps reduce the error value.