LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   The Final (http://book.caltech.edu/bookforum/forumdisplay.php?f=138)
-   -   All things considered... (http://book.caltech.edu/bookforum/showthread.php?t=4306)

doneit 05-22-2013 08:54 AM

All things considered...
After watching lecture 17, especially the final part, it seems the absolute safest approach would be to do unsupervised learning on an anonymous bunch of numbers without knowledge of the domain and without referencing whatever conclusions other people may have arrived at. Then and only then start interpreting whatever patterns arose in light of domain knowledge. Of course customers likely want full packaged solutions and opportunities for optimising the process would have been missed.

All things considered, I'm guessing it would be a brave person (by which I really mean "foolish") who started offering commercial services without a significant period of learning the ropes in real life situations :p

Elroch 05-22-2013 11:09 AM

Re: All things considered...
I haven't actually properly watched lecture 17 yet (except dipping in before I started this course) but I like the way you are thinking. [And by now you may know that I can't resist discussing an interesting issue. Mainly because I think discussion helps clarify concepts]. I have to flag the rest of this post as being my thoughts, with no claim to being definitive.

The importance of how important it is to avoid data snooping is sinking in (explaining some past bad experiences). However, this is only fatal when you fail to keep the data that you use to validate your models untainted. Otherwise the worst that is likely is that you find your hypothesis does not get validated.

One example is when you look at work other people have done. This only taints data that might have directly or indirectly affected their work, or have some special correlation with the data they used. Certainly any data that didn't exist when they did the work is completely untainted (as long as it doesn't have some special relationship to their data that is not the case with other data you might wish to draw conclusions about). For example, someone produces a model about economic activity in country X. Later data from X should be untainted, but data from a country Y in the same period as their data might be tainted (by correlations).

For example, suppose you are working on a topic area and there are two sets of data, S and V. Suppose you hide Vaway without looking at it, then you do all sorts of horrible things with S, looking at it, coming up with hypotheses, testing them, throwing them away, reusing the data and coming up with new hypotheses and so on, and you eventually come up with a hypothesis that you think may be true, but you realize that you have reasons to suspect you may be fooling yourself. If you then test your hypothesis on V, you can make useful conclusions about the hypothesis untainted by your sins with S.

Although that describes what a lot of people (including me in the past) do in real applications before having even heard the term "machine learning", it seems far more likely that a more principled approach to the generation of a hypothesis will result in one that will be validated in pristine out of sample data.

For example, you could split S into two parts A and B without looking at them, Then use A to come up with some sort of understanding of what you are modelling (looking not forbidden) before you start experimenting with hypotheses. It's up to you whether you reckon you can do a better job than a silicon-based Kohonen map on this.

Then come up with an idea of a procedure to generate a hypothesis that might do the job [eg a class of SVM hyperparameters or a class of neural network architectures].

Only now that you have come to a final conclusion on what general algorithm for generating a hypothesis you want to use, use B to do cross validation to arrive at the specific hyperparameters or neural network architecture, and train your SVM or neural network on the whole of set B. Finally (icing on cake) validate your final hypothesis on V (most people would probably be confident enough in well-designed cross-validation to skip this step).

If the amount of data is limited, there must be considerable value in sharpening this procedure to use the data as efficiently as possible without tainting it. If there is a lot, there might be potential in iterating the above procedure to home in on a very well-tailored methodology.

doneit 05-22-2013 09:19 PM

Re: All things considered...
Elroch, thanks for the reply and clarification. I think my post may seem a little pessimistic in retrospect. Maybe I should have said that for the unwary the devil seems very much in the detail (or in this case maybe that's "in the data"?). Lecture 17 highlights some of the subtle dangers and there are some almost surreal aspects to the theory throughout the course (I suppose starting with the fact that you can get any sort of handle on out-of-sample performance at all). My comments were really alluding to theory-vs-practice and relative importance there-of, especially the latter where, as you say, a disciplined approach to handling the data is needed and that's where experience counts.

For example I recently came across Grok and while I don't have the capability to assess it's claims I'm betting the human-element isn't fully excluded from the process for some of the reasons raised in lec.17. IOW's the topic may be *machine* learning but that doesn't mean the whole process can be mechanised (can data know it's biased?).

(1) https://www.groksolutions.com/index.html

All times are GMT -7. The time now is 04:51 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.