LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Homework 6 (http://book.caltech.edu/bookforum/forumdisplay.php?f=135)
-   -   Hw 6 q1 (http://book.caltech.edu/bookforum/showthread.php?t=477)

lucifirm 05-11-2012 03:42 AM

Hw 6 q1
 
The problem question is:
"In general, if we use H′ instead of H, how does deterministic noise behave?"

My doubt is:
Is this for a fixed N? A small N or large enough to get rid of overfitting?

mikesakiandcp 05-11-2012 11:04 AM

Re: Hw 6 q1
 
Quote:

Originally Posted by lucifirm (Post 2041)
The problem question is:
"In general, if we use H′ instead of H, how does deterministic noise behave?"

My doubt is:
Is this for a fixed N? A small N or large enough to get rid of overfitting?

The size of the training set (N) is more related to overfitting of stochastic noise. Deterministic noise is the ability of the hypothesis set to fit the target function.

elkka 05-11-2012 12:59 PM

Re: Hw 6 q1
 
I am also confused about this term "in general". Does it mean - in absolutely any situation. Or does it mean in most situations? Or - in all reasonable situations, excluding cases when we try to fit 10 degree polynomials to 10 points, as in this lecture's example?

mikesakiandcp, I think N has to do with deterministic noise, at least as described in the lecture. Yes, it is the ability of the hypothesis set to fit the target function, measured as expected difference between the "best" hypothesis and the target. But the way we defined the expected hypothesis, as an expectation over infinite number of data sets of specific size N - that depends on N very much. Slide 14, Lecture 11, illustrates the connection.

AqibEjaz 05-11-2012 02:11 PM

Re: Hw 6 q1
 
@elkka: Well the "deterministic noise" is actually independent of N, refer to lecture 08 slide 20, You can see that the "bias" remains the same no matter how large N becomes. With increasing N it is the variance that becomes smaller and hence overall Eout becomes smaller. As I understand it, if you have infinite training sets, then it does not matter whether you have 10 points in each set or 10,000 points, the average hypothesis will remain the same. In case of 10 points, the different hypotheses we get from each training set will be spread all over the place but they will be "centered" around the same hypothesis (i.e. the average hypothesis). In case of 10,000 points, the individual hypotheses will be less spread out but again they will be centered around the same hypothesis as that in the 10 points. "Bias" only depends upon the mismatch between the target function and the modelling hypothesis.

mikesakiandcp 05-11-2012 02:39 PM

Re: Hw 6 q1
 
Quote:

Originally Posted by elkka (Post 2051)
I am also confused about this term "in general". Does it mean - in absolutely any situation. Or does it mean in most situations? Or - in all reasonable situations, excluding cases when we try to fit 10 degree polynomials to 10 points, as in this lecture's example?

mikesakiandcp, I think N has to do with deterministic noise, at least as described in the lecture. Yes, it is the ability of the hypothesis set to fit the target function, measured as expected difference between the "best" hypothesis and the target. But the way we defined the expected hypothesis, as an expectation over infinite number of data sets of specific size N - that depends on N very much. Slide 14, Lecture 11, illustrates the connection.

You are right, N is related to the deterministic noise. What I meant to say is that we have no control over N (since it is the number of inputs in our training set, which we have no control over). Given a fixed training set (and thus a fixed N), we are interested in how well the hypothesis set can approximate the target function.

gjtucker 05-11-2012 03:00 PM

Re: Hw 6 q1
 
It seems like it depends on the definition of deterministic noise. If we define it as E_x[(g^{bar}(x) - f(x))^2] (as was done in the lecture slides) and we assume that g^{bar} is the best hypothesis in H, then it is independent of N.

Where the finite N comes in is through the variance term. With few N, the more complicated model will have a harder time finding the best hypothesis and have high variance, which what we see in the plots in lecture. But, as N increases, my guess is that E_x[(g^{bar}(x) - f(x))^2] says approximately the same, while the variance term goes down. I suppose this wouldn't be too hard to check numerically.

dudefromdayton 05-11-2012 04:55 PM

Re: Hw 6 q1
 
Heads up from the textbook: exercise 4.3 on page 125!

elkka 05-11-2012 06:35 PM

Re: Hw 6 q1
 
Thanks, but I don't have the book

vasilism 05-12-2012 03:37 AM

Re: Hw 6 q1
 
Some questions,
What does it means that a function (H') is a subset of another function (H)?
H' is picked from the same data model we use for H?

yaser 05-12-2012 03:57 AM

Re: Hw 6 q1
 
Quote:

Originally Posted by vasilism (Post 2063)
Some questions,
What does it means that a function (H') is a subset of another function (H)?
H' is picked from the same data model we use for H?

{\cal H} and {\cal H}' are not functions, but rather sets of functions (the hypotheses h\in{\cal H} are functions).

lucifirm 05-12-2012 03:59 AM

Re: Hw 6 q1
 
Thanks you guys, I reviewed the lecture video and that cleared up my ideas. :)

tx75074 08-20-2012 09:53 AM

Re: Hw 6 q1
 
"H prime a subset of H" Does mean any subset of H?

yaser 08-20-2012 01:58 PM

Re: Hw 6 q1
 
Quote:

Originally Posted by tx75074 (Post 4183)
"H prime a subset of H" Does mean any subset of H?

{\cal H}' \subset {\cal H} means {\cal H}' is a proper subset of {\cal H} (not equal to {\cal H}). Other than that, it could be any subset.

jcmorales1564 05-10-2013 05:49 AM

Re: Hw 6 q1
 
Lecture 11 (overfitting) has been my favorite to date. I can’t wait for Lecture 12 (regularization) and 13 (validation) to see how the issue of overfitting is tackled. I thought I was understanding the material; however, I read Q1 in HW6 and could not answer it outright. I realized that I am still somewhat confused and would appreciate some clarification.

I think that one of my issues is how the flow of the lectures slides into situations where the target function is known and then shifts into situations where the target function is not known (real-world cases). I am not stating this as a criticism. It is just that I still don’t know how to clearly “read the signals” that we are moving from one regime (f known) to the other (f not known). For example, to calculate variance and bias (deterministic noise), we need to know the target function. However, in real-world cases we don’t know the target function so it would be impossible to calculate the variance and bias. In Q1, it says that “f is fixed”. This is a case where f is known. I am unclear by what it means by f being fixed. Would not being fixed mean a “moving target”? Are variance and bias useful concepts in real-world cases or are they only of an academic nature, perhaps as a stepping stone to better understand the underlying concepts of machine learning?

I hope that these questions come out sounding right and that I will receive some responses. This issue of overfitting has been the most enlightening thing that I have learned in this course and I just wish to understand it really well.

Thank you.

Juan

yaser 05-10-2013 09:47 AM

Re: Hw 6 q1
 
Quote:

Originally Posted by jcmorales1564 (Post 10792)
In Q1, it says that “f is fixed”. This is a case where f is known. I am unclear by what it means by f being fixed.

I understand how the cases where f was excplicitly given can cause confusion, as this seems to go against the main premise of unknown target function. The best way to resolve this confusion is to assume that someone else knows what the target is, and they will use that information to evaluate different aspects of our learning process, but we ourselves don't not know what f is as we try to learn it from the data.

Having said that, the notion of 'fixed' is different. Q1 describes two learning processes (with two different hypothesis sets) and asserts that both processes are trying to learn the same target. That target can be unknown to both of them, but it is the same target and that is what makes it fixed. The point of having f fixed here is that deterministic noise depends on more than one component in a learning situation, and by fixing the target function we take out one of these dependencies.

jcmorales1564 05-11-2013 02:27 AM

Re: Hw 6 q1
 
Got it! Thank you, professor.

Michael Reach 05-14-2013 02:07 PM

Re: Hw 6 q1
 
Just want to check if I have the idea right here:
Deterministic noise means the bias, the difference between the correct target hypothesis, and the possible hypotheses for this hypothesis set. If H' is smaller than H, so it will be in general be less able to get close to the target hypothesis and the deterministic noise will be bigger. At least it can't be less.

However, though the noise is bigger, there is another effect that will often work in the opposite direction. The larger hypothesis set may give us the dubious ability to fit the deterministic noise better. Since we have more hypotheses to choose from, we may fit more of the noise with the larger hypothesis set, and end up worse off.

Does that sound right?

Elroch 05-14-2013 04:18 PM

Re: Hw 6 q1
 
Quote:

Originally Posted by Michael Reach (Post 10831)
Just want to check if I have the idea right here:
Deterministic noise means the bias, the difference between the correct target hypothesis, and the possible hypotheses for this hypothesis set. If H' is smaller than H, so it will be in general be less able to get close to the target hypothesis and the deterministic noise will be bigger. At least it can't be less.

However, though the noise is bigger, there is another effect that will often work in the opposite direction. The larger hypothesis set may give us the dubious ability to fit the deterministic noise better. Since we have more hypotheses to choose from, we may fit more of the noise with the larger hypothesis set, and end up worse off.

Does that sound right?

Well, you might want to check the precise definition of bias. And one of your conclusions.

Elroch 05-14-2013 04:33 PM

Re: Hw 6 q1
 
Quote:

Originally Posted by yaser (Post 10795)
I understand how the cases where f was excplicitly given can cause confusion, as this seems to go against the main premise of unknown target function. The best way to resolve this confusion is to assume that someone else knows what the target is, and they will use that information to evaluate different aspects of our learning process, but we ourselves don't not know what f is as we try to learn it from the data.

Having said that, the notion of 'fixed' is different. Q1 describes two learning processes (with two different hypothesis sets) and asserts that both processes are trying to learn the same target. That target can be unknown to both of them, but it is the same target and that is what makes it fixed. The point of having f fixed here is that deterministic noise depends on more than one component in a learning situation, and by fixing the target function we take out one of these dependencies.

It has indeed been a source of some discomfort that the phenomenon being studied depends on something that is fixed but unknown! As far as I can see, it is possible to be given the same data and use the same method, and to be overfitting with one target function, but underfitting with another.

This is what made me think when I first saw this issue that it was necessary to have some knowledge about the distribution of the possible functions in order to allow the possibility of assessing the quality of a particular machine learning algorithm for function approximation in a real application. However, I now believe that using the technique of cross-validation gives an objective way of studying out of sample performance for function approximation that should allow probabilistic conclusions roughly analogous to Hoeffding. [I am familiar with this technique from the optimization of hyperparameters when using SVMs]

One of the great things about doing this course is to get to grips with issues like this. In fact I was using the C hyperparameter without really knowing what it was before we got to regularization in the lectures! I hope I've got the right end of the stick now. :)


All times are GMT -7. The time now is 11:07 PM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.