![]() |
#1
|
|||
|
|||
![]()
After watching lecture 17, especially the final part, it seems the absolute safest approach would be to do unsupervised learning on an anonymous bunch of numbers without knowledge of the domain and without referencing whatever conclusions other people may have arrived at. Then and only then start interpreting whatever patterns arose in light of domain knowledge. Of course customers likely want full packaged solutions and opportunities for optimising the process would have been missed.
All things considered, I'm guessing it would be a brave person (by which I really mean "foolish") who started offering commercial services without a significant period of learning the ropes in real life situations ![]() |
#2
|
|||
|
|||
![]()
I haven't actually properly watched lecture 17 yet (except dipping in before I started this course) but I like the way you are thinking. [And by now you may know that I can't resist discussing an interesting issue. Mainly because I think discussion helps clarify concepts]. I have to flag the rest of this post as being my thoughts, with no claim to being definitive.
The importance of how important it is to avoid data snooping is sinking in (explaining some past bad experiences). However, this is only fatal when you fail to keep the data that you use to validate your models untainted. Otherwise the worst that is likely is that you find your hypothesis does not get validated. One example is when you look at work other people have done. This only taints data that might have directly or indirectly affected their work, or have some special correlation with the data they used. Certainly any data that didn't exist when they did the work is completely untainted (as long as it doesn't have some special relationship to their data that is not the case with other data you might wish to draw conclusions about). For example, someone produces a model about economic activity in country ![]() ![]() ![]() For example, suppose you are working on a topic area and there are two sets of data, ![]() ![]() ![]() ![]() ![]() ![]() Although that describes what a lot of people (including me in the past) do in real applications before having even heard the term "machine learning", it seems far more likely that a more principled approach to the generation of a hypothesis will result in one that will be validated in pristine out of sample data. For example, you could split ![]() ![]() ![]() ![]() Then come up with an idea of a procedure to generate a hypothesis that might do the job [eg a class of SVM hyperparameters or a class of neural network architectures]. Only now that you have come to a final conclusion on what general algorithm for generating a hypothesis you want to use, use ![]() ![]() ![]() If the amount of data is limited, there must be considerable value in sharpening this procedure to use the data as efficiently as possible without tainting it. If there is a lot, there might be potential in iterating the above procedure to home in on a very well-tailored methodology. |
#3
|
|||
|
|||
![]()
Elroch, thanks for the reply and clarification. I think my post may seem a little pessimistic in retrospect. Maybe I should have said that for the unwary the devil seems very much in the detail (or in this case maybe that's "in the data"?). Lecture 17 highlights some of the subtle dangers and there are some almost surreal aspects to the theory throughout the course (I suppose starting with the fact that you can get any sort of handle on out-of-sample performance at all). My comments were really alluding to theory-vs-practice and relative importance there-of, especially the latter where, as you say, a disciplined approach to handling the data is needed and that's where experience counts.
For example I recently came across Grok and while I don't have the capability to assess it's claims I'm betting the human-element isn't fully excluded from the process for some of the reasons raised in lec.17. IOW's the topic may be *machine* learning but that doesn't mean the whole process can be mechanised (can data know it's biased?). (1) https://www.groksolutions.com/index.html |
![]() |
Thread Tools | |
Display Modes | |
|
|