Neural Network predictions converging to one value
P.S.: Crossposting my post with minor edits from one of the LFD course subforums. But this probably belongs here. And all are welcome to comment.
Professor AbuMostafa:
Sorry about the length of this post, but I would appreciate your advice on a problem I am facing with a neural network that I am trying to implement for regression.
The problem is that I am finding the predicted values eventually turn out to be the same for all inputs. It seemed to me, after doing some reading, that this is possibly a consequence of the saturation of the hidden layer. This is a network with one hidden layer and one linear output neuron. The hidden layer is nonlinear, and I have tried various sigmoid functions here. I have tried tanh and logistic functions, and then after reading some papers on how these can result in saturation, I also tried rectified linear units i.e. max(0, x). However, after some amount of time that varies with the parameters, the output values are again all equal even with the rectified linear units. I am using gradient descent. I have tried minibatches with several thousand examples, iterating over them to decrease the cost in each batch. I have also tried learning from just one example at a time. I have tried randomly permuting the inputs. I started with the crossentropy error and am now working with mean square error. I have checked the gradient calculation by numerically checking some values by perturbing the weights slightly. I have also tried it with and without regularization. With tanh, the learning seemed to be quick even with one example at a time, but ran into this stuck behavior very early. With rectified linear units, it is learning more slowly but then it seems to be on a big plateau, and it took some time to get into this saturated or saturatedlike state.
I think I am training on a sufficient number of examples. Their number is about 10 times the number of weights, as per my understanding of the VC analysis in your lectures.
I noticed in earlier runs that the predicted values tended to converge to the mean of the target output values of the last minibatch that it had been trained on. It seems to me that it somehow wants to minimize the cost by finding the mean of the target outputs, and then use this mean for prediction. And that is the local minimum that it seems to move to. However, this does not happen right away, so I don't think I have accidentally coded anything into the cost function or the backpropagation specifically asking it to do this. Does the math of backpropagation encourage this specific kind of local minimum (predicted value tending to mean of outputs from minibatch)?
While it is certainly possible there is a bug in my code, is this kind of behavior common? If so, what measures would you recommend to address it? Specifically, if I iterate over the same examples for a much longer duration, can the neural network move out of this state?
In fact, as this is happening even with rectified linear units, is this not theoretically a phenomenon of saturation to fixed hidden layer activations, but some other behavior, related to an overall tendency of the outputs towards certain local minima? Or are they really the same thing?
Is it possible to not get into such a situation by trying out many random combinations of initial weight values?
It seems to me that this is not really a generalization problem, and that regularization may not cure this, though it may find a different minimum where it saturates, by changing the cost. Is this intuition correct?
Thanks again for a wonderful course.
