LFD Book Forum The concept "h is fixed before you generate the data set" is extremely vague
 Register FAQ Calendar Mark Forums Read

#1
03-20-2019, 10:11 PM
 Fromdusktilldawn Junior Member Join Date: Sep 2017 Posts: 5
The concept "h is fixed before you generate the data set" is extremely vague

Can someone please explain to me the concept of "h is fixed before you generate the data set" as appears on page 22 of the text?

As it stands, this is an extremely vague statement. What does it mean by "fixed", what does it mean by "generate"?

Here is a typical modern machine learning pipeline for most students.

Find some data somewhere, typically Kaggle (you don't generate it whatsoever, someone else does it for you through unknown means)

Observe the data, get a sense of its dimensionality, number of data. If data is too large, cannot even load into a computer. Therefore parameters associated with this data MUST be known in order to do machine learning.

Based on the data, categorize it into a typical problem. For example, classification, prediction, etc.

Pick a hypothesis h known to do well for the problem. Say SVM. Tune the hypothesis h so that it can at least accept the data. For example, the dimensionality of the weights in the hypothesis is obtained from the dimensionality of the data. Otherwise, a dimension mismatch error will be thrown by MATLAB and no machine learning can be done.

Train your hypothesis h, parameterized by the weights w, until h achieves the lowest in-sample error. Call that the final hypothesis g.

Use final hypothesis g on test set.

In this pipeline, data is not generated, it is given. h is not fixed, it is adjusted based on the data (type of data, dimensionality of data). If we do not know the data at all, we cannot possibly construct a hypothesis. It would be akin to using a low-pass filter for 1D signals when your data is actually a continuous stream of 3D video frames. The data must be given prior to constructing h, and h must be adjusted based on the problem at hand. This is not a "before", it is clearly an "after".

Why does it seem that this typical learning pipeline do not fit into the learning model described in the book? What does it mean by "h is fixed before you generate the data set" in a practical sense?
#2
03-23-2019, 07:24 AM
 htlin NTU Join Date: Aug 2009 Location: Taipei, Taiwan Posts: 610
Re: The concept "h is fixed before you generate the data set" is extremely vague

Good question. Yes, the statement on page 22 does not fit into the actual learning scenario yet, as explained in your words and similarly on page 23. If you read on, you'll gradually see how we move closer to the actual scenario. What page 22 tries to say is that the fixed h (i.e. a readily-colored bin) is the assumption that the bin model needs. The closest real-world scenario is perhaps when someone hands you a hypothesis before anyone looks at the data (generated by someone else, say, on Kaggle). If you assume that the data generator gathers/generates the data i.i.d. from some distribution, you can *test* the hypothesis using the results on page 22.

Hope this helps.
__________________
When one teaches, two learn.
#3
02-18-2021, 09:39 AM
 Roelof Junior Member Join Date: Feb 2021 Posts: 4
Re: The concept "h is fixed before you generate the data set" is extremely vague

Actually this is more intuitive than it may seem. Going back to the bin model: What you need to remember is that the choice of any hypothesis h determines the color of the marbles in the bin (h(x)=f(x) then green otherwise red).

So suppose you've chosen a particular h, after that you then select a sample out of the bin (this is what is meant with "generate").

Because of probability properties we know that the contents of the bin will be similar to the sample with a certain likelihood.

Now, however, suppose we change the h after we have selected (generated) the sample. What this means is that we are recoloring the marbles in the bin according to the new h.

The sample that you had previously selected knows absolutely nothing about (has absolutely no relation to) this new recoloring. The marbles could be all green or all red or anything else in between depending on the new h. So for example if they were all green and you selected a new sample then that new sample would be all green.

In short, you can't do a selection out of the bin, change the contents of the bin arbitrarily and expect that selection to be able to say something about the new contents of the bin.

 Thread Tools Display Modes Hybrid Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 07:35 PM.