LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Homework 2 (http://book.caltech.edu/bookforum/forumdisplay.php?f=131)
-   -   Lecture 3 Q&A independence of parameter inputs (http://book.caltech.edu/bookforum/showthread.php?t=4186)

 Moobb 04-10-2013 01:55 PM

Lecture 3 Q&A independence of parameter inputs

There is a discussion about the importance of having independent input data and how this propagates to features. Is it true that features necessarily inherit independence from data? If they don't, how bad is that? For example, in Finance there are quite a few studies using support vector machines using a grid defined by different moving averages, which overlap (1w, 1m, etc). In this case the features are clearly not independent. Would this be seen as a questionable procedure?

 Elroch 04-10-2013 02:34 PM

Re: Lecture 3 Q&A independence of parameter inputs

Quote:
 Originally Posted by Moobb (Post 10327) There is a discussion about the importance of having independent input data and how this propagates to features. Is it true that features necessarily inherit independence from data? If they don't, how bad is that? For example, in Finance there are quite a few studies using support vector machines using a grid defined by different moving averages, which overlap (1w, 1m, etc). In this case the features are clearly not independent. Would this be seen as a questionable procedure?
Could you be more precise about which place in the book or lectures you are referring to regarding independence?

With regard to the choice of features for representing financial data, it is not difficult to remove the more obvious dependencies, but it is not clear that this is crucial. As an analog, suppose you have a basis for the plane (1,0) and (1,1). There is clearly a correlation between these two axes in your sense, but a simple linear transformation to a basis of (1,0) and (0,1) gets rid of it. If you are going to use kernels, you will be permitting many transformations of this type or others. The same is true of moving averages, where you can replace them with carefully chosen differences between them if you wish, but it may not be crucial.

 yaser 04-10-2013 03:08 PM

Re: Lecture 3 Q&A independence of parameter inputs

Quote:
 Originally Posted by Moobb (Post 10327) There is a discussion about the importance of having independent input data and how this propagates to features. Is it true that features necessarily inherit independence from data? If they don't, how bad is that? For example, in Finance there are quite a few studies using support vector machines using a grid defined by different moving averages, which overlap (1w, 1m, etc). In this case the features are clearly not independent. Would this be seen as a questionable procedure?
Could you use the [lecture3] macro (see the "Including a lecture video segment" thread at the top) to pinpoint the part you are referring to? Thank you.

 Elroch 04-10-2013 05:31 PM

Re: Lecture 3 Q&A independence of parameter inputs

Moobb, having rewatched the Q&A, my understanding is this. The independence that is important is that the input points are independently selected. Intuitively, they are a representative sample, rather than one which gives disproportionate importance to some region of the input space.

With regard to the features, these are a generalisation of co-ordinates which are used to describe the input data points (eg the value of a moving average is a feature which can be thought of as a co-ordinate, even though it is defined in terms of many co-ordinates). The independence that is preserved after a transformation is the independence between the data points, not the features: the set of points remains a representative sample of the (transformed) space of possible inputs.

 Moobb 04-11-2013 12:32 AM

Re: Lecture 3 Q&A independence of parameter inputs

Many thanks for your answer and sorry for not including the reference for the lecture, it is 1:10:40 (can't include the tag directly right now). I believe I understood it now: if the input points are not independent than chances are it won't generalise well for the full set of possible inputs (taking an example from number identification, if you increase the size of number 8 by a factor of two, you won't learn anything new by doing so). Using the analogy to the coordinate systems, if the features are not independent than you may have less information than you suppose to have, but it may still be more practical than devising a feature that automatically incorporates only new elements, in practice the algorithm will benefit only from the new information incorporated from the feature. Guess there is a practical limit in terms of model complexity at some point? Or that you may end up incorporating just noise, so the use sometimes of dimensionality reduction prior to establishing your features? Thanks again!

 Rahul Sinha 04-11-2013 12:49 AM

Re: Lecture 3 Q&A independence of parameter inputs

Quote:
 Originally Posted by Elroch (Post 10330) Moobb, having rewatched the Q&A, my understanding is this. The independence that is important is that the input points are independently selected. Intuitively, they are a representative sample, rather than one which gives disproportionate importance to some region of the input space. With regard to the features, these are a generalisation of co-ordinates which are used to describe the input data points (eg the value of a moving average is a feature which can be thought of as a co-ordinate, even though it is defined in terms of many co-ordinates). The independence that is preserved after a transformation is the independence between the data points, not the features: the set of points remains a representative sample of the (transformed) space of possible inputs.
Awesome explanation.:bow:

To add an example: Consider a Gaussian distribution with non Diagonal covariance matrix in 2D space. It is obvious that Features (read axis) are correlated or non-independent. Performing a change of co-ordinate system, let's now have the eigenvector directions as the new co-ordinate system. No information is lost in the transformation (The space did not shrink or expand!) but now we have independent orthonomal co-ordinates. As pointed out, what is preserved is the "independence between the data points not the features".

 Elroch 04-11-2013 05:40 AM

Re: Lecture 3 Q&A independence of parameter inputs

First, let's try this neat lecture tag with the time Moobb gave (converted to seconds):

Quote:
 Originally Posted by Moobb (Post 10334) Many thanks for your answer and sorry for not including the reference for the lecture, it is 1:10:40 (can't include the tag directly right now). I believe I understood it now: if the input points are not independent than chances are it won't generalise well for the full set of possible inputs (taking an example from number identification, if you increase the size of number 8 by a factor of two, you won't learn anything new by doing so). Using the analogy to the coordinate systems, if the features are not independent than you may have less information than you suppose to have, but it may still be more practical than devising a feature that automatically incorporates only new elements, in practice the algorithm will benefit only from the new information incorporated from the feature. Guess there is a practical limit in terms of model complexity at some point? Or that you may end up incorporating just noise, so the use sometimes of dimensionality reduction prior to establishing your features? Thanks again!
Yes, non-independence of input data damages generalisation. But reversible transformations of any type don't reduce (or increase) the information content. In principle, there can never be a disadvantage in having extra features any more than there is a disadvantage in having more input points (you can just ignore some of them if you like), but some methods may not perform so well if you use more features.

Regarding financial time series, the assumptions of machine learning are not strictly true. It is recognised empirically that there is non-stationary behaviour (P(x) and P(y | x) change somewhat over time). [This may apply more to P(x) than P(y | x). For example, if you use 1990s data to create a model of the stock market, then used it in 2000, you might experience unexpected behaviour because you had never seen such market conditions before: the input data points could be a long way from any you had seen before. This is essentially the same reason that traders using any methods may lose when market conditions change a lot]

As well as this, there is the non-deterministic component of the behavior. This effectively reduces the size of the input data set for the purpose of predicting the deterministic component (which is the main aim). This may be an issue, since the total amount of data available is rather limited. [If you have a stationary process and as much data as you want, I believe you can get as close to perfect knowledge of the deterministic component as you wish. But of course the non-deterministic component remains here as well].
Quote:
 Originally Posted by Rahul Sinha (Post 10335) Awesome explanation.:bow: To add an example: Consider a Gaussian distribution with non Diagonal covariance matrix in 2D space. It is obvious that Features (read axis) are correlated or non-independent. Performing a change of co-ordinate system, let's now have the eigenvector directions as the new co-ordinate system. No information is lost in the transformation (The space did not shrink or expand!) but now we have independent orthonomal co-ordinates. As pointed out, what is preserved is the "independence between the data points not the features".
:cool: ;)
For more see:
http://en.wikipedia.org/wiki/Multiva...l_distribution
https://en.wikipedia.org/wiki/Princi...onent_analysis
https://www.cs.princeton.edu/courses...notes/0419.pdf
http://www.cs.unm.edu/~williams/cs530/kl3.pdf
Bottom line is that it is always theoretically possible to remove the linear correlations between features using the transformation given by principal component analysis, but they can only be made independent in the case where they are jointly normally distributed (this means that any linear combination of them is normal).

Moreover, this is always a reversible process preserving the independence of your input data points.

As a footnote it is worth mentioning that a key reason market prediction is not entirely hopeless is that they do not exhibit perfect gaussian behaviour. It seems that the non-stationary behaviour is more important than the Hurst exponent being permanently greater than (or less than) 0.5 (it would be 0.5 for simple Brownian motion). eg see http://www.optimaltrader.net/old/hur...ictability.pdf

 Moobb 04-11-2013 07:34 PM

Re: Lecture 3 Q&A independence of parameter inputs

Elroch, thank you so much for your help. Regarding your point about non stationarity and the difficulty that introduces for financial forecasting, do you see that as necessarily invalidating any attempt towards machine learning forecasting in Finace? Could it be that the time series itself is non stationary, but some specific patterns within it (which people try to capture with technical indicators for example) are stable? Those technical indicators would than be your features and maybe when we conditional on them your time series become more stationary? I think another main use in Finance is in terms of classification, which can then be used for portfolio allocation for example.
Many thanks again!!:)

 Elroch 04-12-2013 06:31 AM

Re: Lecture 3 Q&A independence of parameter inputs

Quote:
 Originally Posted by Moobb (Post 10345) Elroch, thank you so much for your help.
Thank you. But bear in mind there is much I do not yet know!
Quote:
 Regarding your point about non stationarity and the difficulty that introduces for financial forecasting, do you see that as necessarily invalidating any attempt towards machine learning forecasting in Finace?
No. The evidence seems pretty clear that it is not entirely hopeless, but a tough struggle.
Quote:
 Could it be that the time series itself is non stationary, but some specific patterns within it (which people try to capture with technical indicators for example) are stable? Those technical indicators would than be your features and maybe when we conditional on them your time series become more stationary? I think another main use in Finance is in terms of classification, which can then be used for portfolio allocation for example. Many thanks again!!:)
I basically agree with that, except that I can't see a reason why any structure is likely to be entirely stable. You can always calculate a technical indicator (typically a real valued, non-linear transformation of past prices), but the idea that probabilistic statements based on one or many such indicators will remain permanently true seems implausible: at the very least the probabilities are going to change over time.

Regarding non-stationarity, there is an awkward conflict between the wish to have plenty of training data and the wish to have training data that is recent enough not to be misleading. One paper I read found an interesting way of dealing with this by weighting more recent data more strongly when training. http://stockresearch.googlecode.com/...prediction.pdf [If this doesn't display in your browser, try saving it and loading in a PDF viewer] Unfortunately, I am not yet aware of how to do this without modifying or rewriting general purpose machine learning tools.

Fortunately discovering eternal truths about markets (or other time series) is not necessary, since there are two things you can do. One is to execute the training process at intervals, and a more radical solution is to replace your approach by a more sophisticated one if it stops being effective enough (for example, you might start with just moving averages as features, and it might work. If it stopped working you might add a Hurst exponent calculated at an appropriate scale, as a complex feature that might make machine learning more feasible, if you describe the problem in the right way. I don't know if this is true, but it may be. :) ). The possibilities are infinite, and I sometimes think that is more of a problem than a help!

And yes, classification is surely as useful as prediction of real valued quantities. I like to think of it in information terms. A binary classification is a prediction of 1 bit of information. There are a huge range of possible bits you might choose to model. A real-valued prediction can be approximated by a classification problem where you have a sequence of bins corresponding to intervals. I am not sure of the relative merits when there is a choice. One issue is how the information being modelled relates to the way it will be used later. Of course any output can be considered as an indicator itself, but then there is the question of how that output will be used in trading and how it will affect trading results. In principle, error measures should be tailored to suit the effect on results, but this may not be easy.

 All times are GMT -7. The time now is 03:28 AM.