View Single Post
Old 04-11-2013, 05:40 AM
Elroch Elroch is offline
Invited Guest
Join Date: Mar 2013
Posts: 143
Default Re: Lecture 3 Q&A independence of parameter inputs

First, let's try this neat lecture tag with the time Moobb gave (converted to seconds):

Originally Posted by Moobb View Post
Many thanks for your answer and sorry for not including the reference for the lecture, it is 1:10:40 (can't include the tag directly right now). I believe I understood it now: if the input points are not independent than chances are it won't generalise well for the full set of possible inputs (taking an example from number identification, if you increase the size of number 8 by a factor of two, you won't learn anything new by doing so). Using the analogy to the coordinate systems, if the features are not independent than you may have less information than you suppose to have, but it may still be more practical than devising a feature that automatically incorporates only new elements, in practice the algorithm will benefit only from the new information incorporated from the feature. Guess there is a practical limit in terms of model complexity at some point? Or that you may end up incorporating just noise, so the use sometimes of dimensionality reduction prior to establishing your features? Thanks again!
Yes, non-independence of input data damages generalisation. But reversible transformations of any type don't reduce (or increase) the information content. In principle, there can never be a disadvantage in having extra features any more than there is a disadvantage in having more input points (you can just ignore some of them if you like), but some methods may not perform so well if you use more features.

Regarding financial time series, the assumptions of machine learning are not strictly true. It is recognised empirically that there is non-stationary behaviour (P(x) and P(y | x) change somewhat over time). [This may apply more to P(x) than P(y | x). For example, if you use 1990s data to create a model of the stock market, then used it in 2000, you might experience unexpected behaviour because you had never seen such market conditions before: the input data points could be a long way from any you had seen before. This is essentially the same reason that traders using any methods may lose when market conditions change a lot]

As well as this, there is the non-deterministic component of the behavior. This effectively reduces the size of the input data set for the purpose of predicting the deterministic component (which is the main aim). This may be an issue, since the total amount of data available is rather limited. [If you have a stationary process and as much data as you want, I believe you can get as close to perfect knowledge of the deterministic component as you wish. But of course the non-deterministic component remains here as well].
Originally Posted by Rahul Sinha View Post
Awesome explanation.

To add an example: Consider a Gaussian distribution with non Diagonal covariance matrix in 2D space. It is obvious that Features (read axis) are correlated or non-independent. Performing a change of co-ordinate system, let's now have the eigenvector directions as the new co-ordinate system. No information is lost in the transformation (The space did not shrink or expand!) but now we have independent orthonomal co-ordinates. As pointed out, what is preserved is the "independence between the data points not the features".

For more see:
Bottom line is that it is always theoretically possible to remove the linear correlations between features using the transformation given by principal component analysis, but they can only be made independent in the case where they are jointly normally distributed (this means that any linear combination of them is normal).

Moreover, this is always a reversible process preserving the independence of your input data points.

As a footnote it is worth mentioning that a key reason market prediction is not entirely hopeless is that they do not exhibit perfect gaussian behaviour. It seems that the non-stationary behaviour is more important than the Hurst exponent being permanently greater than (or less than) 0.5 (it would be 0.5 for simple Brownian motion). eg see
Reply With Quote