View Single Post
Old 05-22-2013, 11:09 AM
Elroch Elroch is offline
Invited Guest
Join Date: Mar 2013
Posts: 143
Default Re: All things considered...

I haven't actually properly watched lecture 17 yet (except dipping in before I started this course) but I like the way you are thinking. [And by now you may know that I can't resist discussing an interesting issue. Mainly because I think discussion helps clarify concepts]. I have to flag the rest of this post as being my thoughts, with no claim to being definitive.

The importance of how important it is to avoid data snooping is sinking in (explaining some past bad experiences). However, this is only fatal when you fail to keep the data that you use to validate your models untainted. Otherwise the worst that is likely is that you find your hypothesis does not get validated.

One example is when you look at work other people have done. This only taints data that might have directly or indirectly affected their work, or have some special correlation with the data they used. Certainly any data that didn't exist when they did the work is completely untainted (as long as it doesn't have some special relationship to their data that is not the case with other data you might wish to draw conclusions about). For example, someone produces a model about economic activity in country X. Later data from X should be untainted, but data from a country Y in the same period as their data might be tainted (by correlations).

For example, suppose you are working on a topic area and there are two sets of data, S and V. Suppose you hide Vaway without looking at it, then you do all sorts of horrible things with S, looking at it, coming up with hypotheses, testing them, throwing them away, reusing the data and coming up with new hypotheses and so on, and you eventually come up with a hypothesis that you think may be true, but you realize that you have reasons to suspect you may be fooling yourself. If you then test your hypothesis on V, you can make useful conclusions about the hypothesis untainted by your sins with S.

Although that describes what a lot of people (including me in the past) do in real applications before having even heard the term "machine learning", it seems far more likely that a more principled approach to the generation of a hypothesis will result in one that will be validated in pristine out of sample data.

For example, you could split S into two parts A and B without looking at them, Then use A to come up with some sort of understanding of what you are modelling (looking not forbidden) before you start experimenting with hypotheses. It's up to you whether you reckon you can do a better job than a silicon-based Kohonen map on this.

Then come up with an idea of a procedure to generate a hypothesis that might do the job [eg a class of SVM hyperparameters or a class of neural network architectures].

Only now that you have come to a final conclusion on what general algorithm for generating a hypothesis you want to use, use B to do cross validation to arrive at the specific hyperparameters or neural network architecture, and train your SVM or neural network on the whole of set B. Finally (icing on cake) validate your final hypothesis on V (most people would probably be confident enough in well-designed cross-validation to skip this step).

If the amount of data is limited, there must be considerable value in sharpening this procedure to use the data as efficiently as possible without tainting it. If there is a lot, there might be potential in iterating the above procedure to home in on a very well-tailored methodology.
Reply With Quote