Quote:
Originally Posted by rk1000
Thanks for your comment. If I understand your post correctly, you're saying that maybe the inputs are not far from random? Well, I'm hoping that's not the case, but it's certainly possible that my representation of the actual input has some issues, as I was trying out what, to my knowledge, may be a nonstandard way to represent this input.
And just to complete the picture about the training runs ... I did run it for 100 epochs with the same initial values, and I completed 50 runs with different random initial settings. I had to stop it midway through the 100 random runs, as it was beginning to sort of take over my computer. :) I ran these with no regularization, and none of them showed the “identical predictions” problem, though they do not show good learning behavior. But with regularization, I do see the problem.

The most important thing is the statistical relationship between inputs and outputs, P(y  x), not the distribution of inputs in isolation, P(x). In principle, reversible transformations of the inputs really just disguise this relationship. Of course, transformations may have an effect on the behaviour of a tool such as a neural network, which is why they are used.
This relationship P(y  x) comprises a deterministic part plus noise which arises in different ways but has the same effect on the statistics.
In my experience, if you provide inputs that are explicitly independent of the outputs (so the outputs are independent of the inputs and P(y  x) is entirely random noise), a neural network
will generally converge to a constant function whose value is the average of the outputs. The reason is that this function gives the absolute minimum RMSE. If a neural network converges to anything else in this case, it must be fitting the noise. This is unlikely to happen unless there is a small number of input points compared to the complexity of the neural network.
I should make clear that my understanding of the above is empirical with a core of simple probability theory. The detailed behaviour of neural networks is very obscure, and I am glossing over issues such as local minima, merely because I haven't seen this confusing the issue and suspect it generally won't.
As for the technical details, it is useful to monitor the RMSE errors on out of sample data as a neural network is being trained, because this helps distinguish between the useful effect of training (generalisation) and the bad effect of training (overfitting). This applies whether there is a deterministic relationship between inputs and outputs, a noisy relationship, or even when they are totally independent (in this case, a network can first model the average, but then may learn the random noise in a way which increases out of sample RMSE.
Can you describe the nature of your data? Is it financial time series data, perhaps?