Quote:
Originally Posted by marek
I must be missed something, but I do not understand why we permute the data.
treats each data point separately, but then sums them all up. Thus, even if we do permute the data points, in the end it all gets combined together in this sum. What am I overlooking?

Ture. If we were applying batch mode, permutation would not change anything since the weight update is done at the end of the epoch and takes all the examples into consideration regardless of the order they were presented. In Stochastic gradient descent, however, the update is done after each example, so the order changes the outcome. These permutations ensure that the order is randomized so we get the benefits of randomness that were mentioned briefly in Lecture 9.