First, let's try this neat lecture tag with the time Moobb gave (converted to seconds):
Quote:
Originally Posted by Moobb
Many thanks for your answer and sorry for not including the reference for the lecture, it is 1:10:40 (can't include the tag directly right now). I believe I understood it now: if the input points are not independent than chances are it won't generalise well for the full set of possible inputs (taking an example from number identification, if you increase the size of number 8 by a factor of two, you won't learn anything new by doing so). Using the analogy to the coordinate systems, if the features are not independent than you may have less information than you suppose to have, but it may still be more practical than devising a feature that automatically incorporates only new elements, in practice the algorithm will benefit only from the new information incorporated from the feature. Guess there is a practical limit in terms of model complexity at some point? Or that you may end up incorporating just noise, so the use sometimes of dimensionality reduction prior to establishing your features? Thanks again!

Yes, nonindependence of input data damages generalisation. But reversible transformations of any type don't reduce (or increase) the information content. In principle, there can never be a disadvantage in
having extra features any more than there is a disadvantage in having more input points (you can just ignore some of them if you like), but some methods may not perform so well if you
use more features.
Regarding financial time series, the assumptions of machine learning are not strictly true. It is recognised empirically that there is nonstationary behaviour (P(x) and P(y  x) change somewhat over time). [This may apply more to P(x) than P(y  x). For example, if you use 1990s data to create a model of the stock market, then used it in 2000, you might experience unexpected behaviour because you had never seen such market conditions before: the input data points could be a long way from any you had seen before. This is essentially the same reason that traders using any methods may lose when market conditions change a lot]
As well as this, there is the nondeterministic component of the behavior. This effectively reduces the size of the input data set for the purpose of predicting the deterministic component (which is the main aim). This may be an issue, since the total amount of data available is rather limited. [If you have a stationary process and as much data as you want, I believe you can get as close to perfect knowledge of the deterministic component as you wish. But of course the nondeterministic component remains here as well].
Quote:
Originally Posted by Rahul Sinha
Awesome explanation.
To add an example: Consider a Gaussian distribution with non Diagonal covariance matrix in 2D space. It is obvious that Features (read axis) are correlated or nonindependent. Performing a change of coordinate system, let's now have the eigenvector directions as the new coordinate system. No information is lost in the transformation (The space did not shrink or expand!) but now we have independent orthonomal coordinates. As pointed out, what is preserved is the "independence between the data points not the features".

For more see:
http://en.wikipedia.org/wiki/Multiva...l_distribution
https://en.wikipedia.org/wiki/Princi...onent_analysis
https://www.cs.princeton.edu/courses...notes/0419.pdf
http://www.cs.unm.edu/~williams/cs530/kl3.pdf
Bottom line is that it is always theoretically possible to remove the linear correlations between features using the transformation given by principal component analysis, but they can only be made independent in the case where they are jointly normally distributed (this means that any linear combination of them is normal).
Moreover, this is always a reversible process preserving the independence of your input data points.
As a footnote it is worth mentioning that a key reason market prediction is not entirely hopeless is that they do not exhibit perfect gaussian behaviour. It seems that the nonstationary behaviour is more important than the Hurst exponent being permanently greater than (or less than) 0.5 (it would be 0.5 for simple Brownian motion). eg see
http://www.optimaltrader.net/old/hur...ictability.pdf