Neural Network predictions converging to one value
P.S.: Crossposting my post with minor edits from one of the LFD course subforums. But this probably belongs here. And all are welcome to comment.
Professor AbuMostafa: Sorry about the length of this post, but I would appreciate your advice on a problem I am facing with a neural network that I am trying to implement for regression. The problem is that I am finding the predicted values eventually turn out to be the same for all inputs. It seemed to me, after doing some reading, that this is possibly a consequence of the saturation of the hidden layer. This is a network with one hidden layer and one linear output neuron. The hidden layer is nonlinear, and I have tried various sigmoid functions here. I have tried tanh and logistic functions, and then after reading some papers on how these can result in saturation, I also tried rectified linear units i.e. max(0, x). However, after some amount of time that varies with the parameters, the output values are again all equal even with the rectified linear units. I am using gradient descent. I have tried minibatches with several thousand examples, iterating over them to decrease the cost in each batch. I have also tried learning from just one example at a time. I have tried randomly permuting the inputs. I started with the crossentropy error and am now working with mean square error. I have checked the gradient calculation by numerically checking some values by perturbing the weights slightly. I have also tried it with and without regularization. With tanh, the learning seemed to be quick even with one example at a time, but ran into this stuck behavior very early. With rectified linear units, it is learning more slowly but then it seems to be on a big plateau, and it took some time to get into this saturated or saturatedlike state. I think I am training on a sufficient number of examples. Their number is about 10 times the number of weights, as per my understanding of the VC analysis in your lectures. I noticed in earlier runs that the predicted values tended to converge to the mean of the target output values of the last minibatch that it had been trained on. It seems to me that it somehow wants to minimize the cost by finding the mean of the target outputs, and then use this mean for prediction. And that is the local minimum that it seems to move to. However, this does not happen right away, so I don't think I have accidentally coded anything into the cost function or the backpropagation specifically asking it to do this. Does the math of backpropagation encourage this specific kind of local minimum (predicted value tending to mean of outputs from minibatch)? While it is certainly possible there is a bug in my code, is this kind of behavior common? If so, what measures would you recommend to address it? Specifically, if I iterate over the same examples for a much longer duration, can the neural network move out of this state? In fact, as this is happening even with rectified linear units, is this not theoretically a phenomenon of saturation to fixed hidden layer activations, but some other behavior, related to an overall tendency of the outputs towards certain local minima? Or are they really the same thing? Is it possible to not get into such a situation by trying out many random combinations of initial weight values? It seems to me that this is not really a generalization problem, and that regularization may not cure this, though it may find a different minimum where it saturates, by changing the cost. Is this intuition correct? Thanks again for a wonderful course. 
Re: Neural Network predictions converging to one value
Quote:

Re: Neural Network predictions converging to one value
Professor,
The training data points do have different target outputs. After training from a minibatch of the training set and adjusting the weights, when I try to predict the outputs of the same training minibatch, it produces identical (incorrect) outputs when it has undergone some training. Initially i.e. at the beginning of training, these values are very far wrong and also different but many are the same, but eventually after some training, all the predictions on the training set start to converge, and they get closer to the mean of the outputs of the latest training minibatch. I get the same output value if I then run the network on a validation set. When I continue training from the next minibatch and then test with that minibatch and then a validation set, the behavior is the same, but the predicted value changes after each training minibatch. I hope I have answered your question. Initially, I had conjectured that maybe, the network is completely biased and is just providing a fixed output from the weight of the bias term and no contribution from the other weights, and so I removed the bias term altogether. But the behavior stayed the same, and I brought the bias term back. 
Re: Neural Network predictions converging to one value
Quote:

Re: Neural Network predictions converging to one value
Sure, I just tried the training that you suggested. With repeated training over the same 2 examples that have different target outputs, it learns correctly, and eventually predicts them both with no training error.

Re: Neural Network predictions converging to one value
Quote:
1. Computational: Not enough epochs, or a bad local minimum. 2. Inherent: The target function is almost impossible to capture given the size of the network. Let's deal with 1 first. Try a very long run with say 100 times the epochs and see if the result is better. Also try 100 different runs (initial random weights seeded differently) with the smaller number of epochs and see how the best result is. 
Re: Neural Network predictions converging to one value
Thanks a lot for your suggestions. I will try them out, and let you know the results.

Re: Neural Network predictions converging to one value
Professor,
I have been running the training as you suggested. It's still in progress, but I thought I would mention some things I have found in the meantime. Without regularization, it has actually not got stuck (in the identical predictions problem on the training set) after 24 epochs. This is contrary to what I had reported in my first post, about regularization not affecting this problem. I must have got my observations mixed up while juggling the different model hyperparameters. Sorry about that. However, with some regularization, I am seeing the problem I saw before. So, it must have been the regularization that pushed it over to a highbias region, where the best it could do was to learn the mean of the outputs it most recently saw and predict that for every example. In that case, maybe this is similar to the condition shown in the last curve on slide 12 of your lecture on regularization? I still notice significant training error in the long manyepoch run, and even in the randomized runs, though it probably hasn't gone through enough iterations of random runs yet to be sure it's always the case. But if this trend continues, I suppose that means the model may not have sufficient number of parameters for this problem? 
Re: Neural Network predictions converging to one value
One possibility which I cannot exclude from your description is that the nonrandom part of the relationship between your inputs and outputs is insignificant compared to the random component. In this case, you may arrive at all inputs predicting an average of the outputs. This is the best that can be done if the outputs are entirely random, so there is no learnable content.

Re: Neural Network predictions converging to one value
Thanks for your comment. If I understand your post correctly, you're saying that maybe the inputs are not far from random? Well, I'm hoping that's not the case, but it's certainly possible that my representation of the actual input has some issues, as I was trying out what, to my knowledge, may be a nonstandard way to represent this input.
And just to complete the picture about the training runs ... I did run it for 100 epochs with the same initial values, and I completed 50 runs with different random initial settings. I had to stop it midway through the 100 random runs, as it was beginning to sort of take over my computer. :) I ran these with no regularization, and none of them showed the “identical predictions” problem, though they do not show good learning behavior. But with regularization, I do see the problem. 
Re: Neural Network predictions converging to one value
I have a couple of points, based on not dissimilar experiences of my own.
First, are you concentrating on the calculated errors on your out of sample data as you train the neural network? In sample errors are not easy to draw conclusions from (unless your data set is very large compared to the complexity of the neural network). I am not sure what software you are using, but in JNNS for example, you can see a graph of OOS errors as you are training. Secondly, as a simple test, you could try repeating the training with the input data replaced by entirely random data (but keeping the same output data) to see the comparison. 
Re: Neural Network predictions converging to one value
Once it gets into that situation, the results are the same both insample and outofsample. It sounds like an interesting idea to try the training with random data, to see if there is any issue with the input. Thanks.

Re: Neural Network predictions converging to one value
Quote:
This relationship P(y  x) comprises a deterministic part plus noise which arises in different ways but has the same effect on the statistics. In my experience, if you provide inputs that are explicitly independent of the outputs (so the outputs are independent of the inputs and P(y  x) is entirely random noise), a neural network will generally converge to a constant function whose value is the average of the outputs. The reason is that this function gives the absolute minimum RMSE. If a neural network converges to anything else in this case, it must be fitting the noise. This is unlikely to happen unless there is a small number of input points compared to the complexity of the neural network. I should make clear that my understanding of the above is empirical with a core of simple probability theory. The detailed behaviour of neural networks is very obscure, and I am glossing over issues such as local minima, merely because I haven't seen this confusing the issue and suspect it generally won't. As for the technical details, it is useful to monitor the RMSE errors on out of sample data as a neural network is being trained, because this helps distinguish between the useful effect of training (generalisation) and the bad effect of training (overfitting). This applies whether there is a deterministic relationship between inputs and outputs, a noisy relationship, or even when they are totally independent (in this case, a network can first model the average, but then may learn the random noise in a way which increases out of sample RMSE. Can you describe the nature of your data? Is it financial time series data, perhaps? 
All times are GMT 7. The time now is 08:20 PM. 
Powered by vBulletin® Version 3.8.3
Copyright ©2000  2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. AbuMostafa, Malik MagdonIsmail, and HsuanTien Lin, and participants in the Learning From Data MOOC by Yaser S. AbuMostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.