LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 1 - The Learning Problem (http://book.caltech.edu/bookforum/forumdisplay.php?f=108)
-   -   Exercises and Problems (http://book.caltech.edu/bookforum/showthread.php?t=257)

yaser 03-25-2012 12:24 AM

Exercises and Problems
 
Please comment on the chapter problems in terms of difficulty, clarity, and time demands. This information will help us and other instructors in choosing problems to assign in our classes.

Also, please comment on the exercises in terms of how useful they are in understanding the material.

pablo 04-21-2012 10:20 PM

Exercise 1.10
 
Since this is a theory week, thought this might be a good time to explore Hoeffding a bit.

I understand c_1 and c_rand satisfy Hoeffding experimentally as described, but conceptually does c_rand satisfy Hoeffding? For example, suppose it is unknown whether each coin is fair (or that they are known to have varying fairness - e.g. c_1 is 50/50, c_2 is 40/60, etc.). Would each coin represent a separate 'bin' or would the random selection of a coin plus the ten flips represent the randomized selection condition for c_rand?

Trying to understand if it's necessary for the coins to be identical.

yaser 04-21-2012 11:32 PM

Re: Exercise 1.10
 
Quote:

Originally Posted by pablo (Post 1523)
Since this is a theory week, thought this might be a good time to explore Hoeffding a bit.

I understand c_1 and c_rand satisfy Hoeffding experimentally as described, but conceptually does c_rand satisfy Hoeffding? For example, suppose it is unknown whether each coin is fair (or that they are known to have varying fairness - e.g. c_1 is 50/50, c_2 is 40/60, etc.). Would each coin represent a separate 'bin' or would the random selection of a coin plus the ten flips represent the randomized selection condition for c_rand?

Trying to understand if it's necessary for the coins to be identical.

Interesting question indeed. Hoeffding does apply to each randomly selected coin individually as you point out. If the coins have different values of E_{\rm out}, then the added randomization due to the selection of a coin affects the relationship between E_{\rm in} and E_{\rm out}. This is exactly the premise of the complete bin model analysis.

tadworthington 07-02-2012 03:13 PM

Re: Exercises and Problems
 
Problem 1.4

WHAT I LIKED:
This is an excellent problem, and a model one in my opinion for helping a student to understand material. It starts out easy, with a small data set - that, importantly, is user-generated. Having to generate a data set - even one as simple as a linearly separable data set in 2 dimensions - goes a long way to helping understand how the Perceptron works and why it wouldn't work if the data were not linearly separable. We gradually progress to bigger data sets, all along the way plotting the data so that we can see what is happening. It is not only instructive but also quite exhilarating to see the results of the PLA (plotting both the target function and the hypothesis generated by the PLA) actually graphed out in front of you on a computer screen!

I also thought that the progression to 10 dimensions only after working through the more easily visualized 2 dimensional examples is perfect. Finally, the histogram approach to understanding the computational complexity of the PLA was, I thought, genius.

TIME SPENT:
Overall I spent about 90 minutes on the problem, although a lot of that was spent documenting my code for future reference, and just general "prettying" of my plots and results; so I would guess that this problem can be completed comfortably in an hour, assuming knowledge of programming in a language where plotting is easy (I used Python, but Matlab/Octave would also be excellent examples for quick-and-dirty graphics programming of this type.) For someone with no programming experience, the problem would probably take much more time.

CLARITY:
I think the problem is exceptionally clear.

DIFFICULTY:
I thought the problem was relatively easy, but the Perceptron is a relatively easy concept so I think making it harder is neither necessary nor appropriate. For metrics purposes, on a 1-10 scale of difficulty I would give this a 3.

htlin 07-04-2012 03:16 PM

Re: Exercises and Problems
 
Quote:

Originally Posted by tadworthington (Post 3335)
Problem 1.4

Thank you so much for the valuable feedback!

vsthakur 07-27-2012 08:38 PM

Re: Exercise 1.10
 
While comparing the Coin-Flip experiment with the Bin-Model
a) Sample size of the individual Bin is equal to the number of coin flips (N=10)
b) Number of hypothesis is equal to number of coins (M=1000)

but, the aspect of repeating the entire coin-flip experiment large number of times (100,000) does not have any counterpart with the (Single or Multiple) Bin-Model.

Is the purpose of repeating the coin-flip experiment only to find the estimates of P[|v-mu|>epsilon] without using the hoeffding inequality and then comparing it with the bound given by hoeffding ? Kindly clarify this point.

Thanks.

yaser 07-27-2012 09:30 PM

Re: Exercise 1.10
 
Quote:

Originally Posted by vsthakur (Post 3717)
While comparing the Coin-Flip experiment with the Bin-Model
a) Sample size of the individual Bin is equal to the number of coin flips (N=10)
b) Number of hypothesis is equal to number of coins (M=1000)

but, the aspect of repeating the entire coin-flip experiment large number of times (100,000) does not have any counterpart with the (Single or Multiple) Bin-Model.

Is the purpose of repeating the coin-flip experiment only to find the estimates of P[|v-mu|>epsilon] without using the hoeffding inequality and then comparing it with the bound given by hoeffding ? Kindly clarify this point.

Thanks.

The purpose is to average out statistical fluctuations. In any given run, we may get unusual \nu's by coincidence, but by taking the average we can be confident that we captured the typical \nu values.

vsthakur 08-10-2012 07:56 PM

Problem 1.10 : Expected Off Training Error
 
Hi,
If I got it right, in a noiseless setting, for a fixed D, if all f are equally likely, the expected off-training-error of any hypothesis h is 0.5 (part d of problem 1.10, page 37) and hence any two algorithms are the same in terms of expected off training error (part e of the same problem).

My question is, does this not contradict the generalization by Hoeffding. Specifically, the following point is bothering me

By Hoeffding : Ein approaches Eout for larger number of hypothesis (i.e for small epsilon) as N grows sufficiently large. Which would imply that expected(Eout) should be approximately the same as expected(Ein) and not a constant (0.5).

Can you please provide some insight on this, perhaps my comparison is erroneous.

Thanks.

yaser 08-10-2012 09:25 PM

Re: Problem 1.10 : Expected Off Training Error
 
Quote:

Originally Posted by vsthakur (Post 3952)
Hi,
If I got it right, in a noiseless setting, for a fixed D, if all f are equally likely, the expected off-training-error of any hypothesis h is 0.5 (part d of problem 1.10, page 37) and hence any two algorithms are the same in terms of expected off training error (part e of the same problem).

My question is, does this not contradict the generalization by Hoeffding. Specifically, the following point is bothering me

By Hoeffding : Ein approaches Eout for larger number of hypothesis (i.e for small epsilon) as N grows sufficiently large. Which would imply that expected(Eout) should be approximately the same as expected(Ein) and not a constant (0.5).

Can you please provide some insight on this, perhaps my comparison is erroneous.

This is an important question, and I thank you for asking it. There is a subtle point that creates this impression of contradiction.

On face value, the statement "all f are equally likely" sounds reasonable. Mathematically, it corresponds to trying to learn a randomly generated target function, and getting the average performance of this learning process. It should not be a surprise that we would get 0.5 under these circumstances, since almost all random functions are impossible to learn.

In terms of E_{\rm in} and E_{\rm out}, Hoeffding inequality certainly holds for each of these random target functions, but E_{\rm in} itself will be close to 0.5 for almost all of these functions since they have no pattern to fit, so Hoeffding would indeed predict E_{\rm out} to be close to 0.5 on average.

This is why learning was decomposed into two separate questions in this chapter. In terms of these two questions, the one that "fails" in the random function approach is "E_{\rm in}\approx 0?"

Let me finally comment that treating "all f are equally likely" as a plausible statement is a common trap. This issue is addressed in detail in the context of Bayesian learning in the following video segment:

http://work.caltech.edu/library/182.html

vsthakur 08-10-2012 11:56 PM

Re: Problem 1.10 : Expected Off Training Error
 
My takeaway point : All f are equally likely corresponds to trying to learn a randomly generated target function

Thanks for the detailed explanation. The Bayesian Learning example highlights the ramifications of this assumption, very useful point indeed.

magdon 08-11-2012 07:15 PM

Re: Problem 1.10 : Expected Off Training Error
 
Hoeffding applies to a single hypothesis h. To take a concrete case of the setting in the problem, suppose that f is obtained by flipping a fair coin for every data point to obtain \pm 1. The problem proves that your expected off training set error is 0.5; no surprise, since f is random

Hoeffding tells you that your training error (if N is large enough) will not deviate far from Eout with very high probability. This means that with very high probability, Ein will be close to 0.5.

Problem 1.10 says that if you are in such a pathalogical learning situation with a random f, then no matter what you do in-sample, your out-of-sample performance will be very bad. (This is sometimes called no-free-lunch, because you cannot expect to succeed unless you assume that f is not completely random).

Hoeffding (and VC) says that provided your hypothesis set is not too large compared to N, you will know that you are in a bad situation because your Ein will reflect Eout and be close to 0.5. The nature of your \cal H with respect to f will determine how good your Eout; if f is very bad, then Eout will be close to 0.5. Hoeffding role is to ensure that Ein will tell you that you are in this tough situation.

To summarize: problem 1.10 identifies one case in which you are in trouble and Eout will be 0.5. Hoeffeing tells you when you will be able to know that you are in trouble.


Quote:

Originally Posted by vsthakur (Post 3952)
Hi,
If I got it right, in a noiseless setting, for a fixed D, if all f are equally likely, the expected off-training-error of any hypothesis h is 0.5 (part d of problem 1.10, page 37) and hence any two algorithms are the same in terms of expected off training error (part e of the same problem).

My question is, does this not contradict the generalization by Hoeffding. Specifically, the following point is bothering me

By Hoeffding : Ein approaches Eout for larger number of hypothesis (i.e for small epsilon) as N grows sufficiently large. Which would imply that expected(Eout) should be approximately the same as expected(Ein) and not a constant (0.5).

Can you please provide some insight on this, perhaps my comparison is erroneous.

Thanks.


vsthakur 08-13-2012 03:09 AM

Re: Problem 1.10 : Expected Off Training Error
 
Thank you

axelrv 09-02-2012 04:46 PM

Exercise 1.5
 
The exercise is to determine whether each problem is more suited for the learning or design approach.

Part a "Determining the age at which a particular medical test is recommended" seems like an information retrieval problem not a learning or design problem. Doctors have already decided when to recommend a test.

Perhaps it should read: "Determining the age at which a particular medical test should be recommended or performed" to make it a learning problem.

magdon 09-03-2012 06:42 AM

Re: Exercise 1.5
 
Yes, the use of "is" makes it seem like the answer is known and just needs to be retrieved; "should be" would have been a better choice of words, thanks.

Quote:

Originally Posted by axelrv (Post 4817)
The exercise is to determine whether each problem is more suited for the learning or design approach.

Part a "Determining the age at which a particular medical test is recommended" seems like an information retrieval problem not a learning or design problem. Doctors have already decided when to recommend a test.

Perhaps it should read: "Determining the age at which a particular medical test should be recommended or performed" to make it a learning problem.


cygnids 01-13-2013 10:14 AM

Re: Exercises and Problems
 
With respect to the discussion on Prob 1.10 thread here, it is noted that learning is not possible when the target function is a random function. On that note, what crosses my mind is how does one reconcile that statement with the thought that a purely random function could have more characterizable features after transformation to other domains? For eg., white noise maps to constant in spectral domain. Perhaps naively, if I had the a set of images with a reasonable proportion of fully noisy images, and choose to apply the ML apparatus in the frequency domain, I could learn a system to classify noisy vs not-noisy? Right?

It's very likely I'm picking on a point out of context here?

cygnids 01-13-2013 10:40 AM

Re: Exercises and Problems
 
I suppose I'm interpreting "noise" rather loosely? When I said white noise maps to constant in the frequency domain, I was thinking Gaussian noise, and in which case, we have a case where the output can be characterized as a PDF, and hence not quite "noise" since it can be probabilistically characterized? :clueless:

magdon 01-13-2013 02:15 PM

Re: Exercises and Problems
 
It is true that white noise in the time domain is uniform in the frequency domain. But I do not quite follow how that relates to the prediction task, which in your example can be formulated as follows:

A target function f(x) is white noise. You are given f(x_n) on some data points x_n and asked to learn how to predict f on a test data point x_test. Your best prediction is 0 and your error will be distributed according to a Gaussian, no matter what your training data were, which is by definition of white noise. This would be your prediction even before you saw any training data, so you have not improved your prediction (i.e. you have not been able to learn) from the training data.

What bounds like Hoeffding tell you is that you will find out from your training data that your error will be large.

The same happens in classification where the function is a random \pm1. The data does not help you to predict outside the data, but at least you will learn from your data that you cannot predict outside the data. It is better than learning nothing :).

Quote:

Originally Posted by cygnids (Post 8612)
With respect to the discussion on Prob 1.10 thread here, it is noted that learning is not possible when the target function is a random function. On that note, what crosses my mind is how does one reconcile that statement with the thought that a purely random function could have more characterizable features after transformation to other domains? For eg., white noise maps to constant in spectral domain. Perhaps naively, if I had the a set of images with a reasonable proportion of fully noisy images, and choose to apply the ML apparatus in the frequency domain, I could learn a system to classify noisy vs not-noisy? Right?

In learning, it is important to distinguish between how well you can predict outside the data, and how well you think you can predict outside the data. We only have access to the latter and the first task of the theory is to ensure that we have not fooled ourselves, so that the latter matches the former.

It's very likely I'm picking on a point out of context here?


cygnids 01-16-2013 08:37 AM

Re: Exercises and Problems
 
Thank you for your response and explanation. Mea culpa. I was thinking a bit too generally, perhaps loosely too, when I should have thought through an example such as the one you note.

In passing, let me restate what I had in mind in my earlier post. To begin with, I shouldn't have used the word frequency, but rather, I should have said wavelength since I used images as my example. I thought, if I had perfectly noisy set of images as input, and my end goal was to classify noisy vs not-noisy images, I would have a case where I attempt to learn the problem after a 2d Fourier transform of the images. For sake of discussion, in the transformed space, we'd see a flatish surface corresponding to various (spatial wavelengths, ie equal energy in all wavelengths since it is a Fourier transformed Gaussian), and basically a constant/flat function (glossing over a lot of practical points such as tapering to keep it band-limited). I would use this smooth *characterization of noise* as an input to the learning algorithm, and eventually classify new images. Ie, if in the wavelength space the transformed image is flattish => it is "noisy" image, else => it is a "not-noisy" image.

In my mind, I started out with a target function which was completely random, ie noisy, and I used the Fourier transformed space to turn it into a near-constant (smooth) function which I could use for classification. And presumably, I one managed to learn using a perfectly noisy image set. This thinking prompted my question, which questioned why random functions can't be learned. Needless adding, I did not map my thinking properly enough; the target function for this binary classification isn't the 2d Fourier transform; and further, I also mapped the input space (ie image pixel location) onto a wavelength space.

I like your simple example. Thank you.

magdon 01-16-2013 10:36 AM

Re: Exercises and Problems
 
Good example top clarify an important issue. In your example, the target function is not random. It assigns to a signal that is white noise a -1 and to a signal that is not white noise (say, containing a real image) a +1. A random target would assign random +1,-1 to the signal.

What you have described is what we call feature construction and you will study it later in the book. Specifically, given the input signal construct the fourrier transform and take (for example) the variance of this fourier signal. This variance is a good feature of the input for determining \pm1 (small variance would indicate a noise signal). Good feature construction is a very important part of learning from data successfully.



Quote:

Originally Posted by cygnids (Post 8722)
Thank you for your response and explanation. Mea culpa. I was thinking a bit too generally, perhaps loosely too, when I should have thought through an example such as the one you note.

In passing, let me restate what I had in mind in my earlier post. To begin with, I shouldn't have used the word frequency, but rather, I should have said wavelength since I used images as my example. I thought, if I had perfectly noisy set of images as input, and my end goal was to classify noisy vs not-noisy images, I would have a case where I attempt to learn the problem after a 2d Fourier transform of the images. For sake of discussion, in the transformed space, we'd see a flatish surface corresponding to various (spatial wavelengths, ie equal energy in all wavelengths since it is a Fourier transformed Gaussian), and basically a constant/flat function (glossing over a lot of practical points such as tapering to keep it band-limited). I would use this smooth *characterization of noise* as an input to the learning algorithm, and eventually classify new images. Ie, if in the wavelength space the transformed image is flattish => it is "noisy" image, else => it is a "not-noisy" image.

In my mind, I started out with a target function which was completely random, ie noisy, and I used the Fourier transformed space to turn it into a near-constant (smooth) function which I could use for classification. And presumably, I one managed to learn using a perfectly noisy image set. This thinking prompted my question, which questioned why random functions can't be learned. Needless adding, I did not map my thinking properly enough; the target function for this binary classification isn't the 2d Fourier transform; and further, I also mapped the input space (ie image pixel location) onto a wavelength space.

I like your simple example. Thank you.


cygnids 01-17-2013 12:52 AM

Re: Exercises and Problems
 
Yes, I do see your point that in my example even though the input images may be completely noisy, the target function in my construct isn't random. Thank you for your explanations. I do appreciate it.

My apologies for distorting this thread by my inquiry!

srport@pacbell.net 01-26-2013 11:04 PM

Re: Exercises and Problems
 
Question about the pocket algorithm described on p. 80 and illustrated on p. 83.

For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse.

Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector.

My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process.

Can you please comment?

yaser 01-26-2013 11:56 PM

Re: Exercises and Problems
 
Quote:

Originally Posted by srport@pacbell.net (Post 9019)
Question about the pocket algorithm described on p. 80 and illustrated on p. 83.

For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse.

Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector.

My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process.

Can you please comment?

Hi,

There are two {\bf w} vectors that the pocket algorithm keeps track of. The "pocket weight vector" \hat{\bf w} is the one that is not updated if E_{\rm in} gets worse. However, the other weight vector {\bf w}(t), which is identical to its counterpart in PLA, keeps getting updated with every iteration using a misclassified point. The algorithm then reports \hat{\bf w} as its final solution.

mariovela 04-15-2013 09:32 AM

Re: Exercise 1.10
 
On exercise 1.10 (c).. I'm getting confused on how to plot P[|v-u|>e] as a function of e... I assumed the distribution is binomial right?

giridhar1202 09-09-2015 11:17 AM

Re: Exercises and Problems
 
I conducted experiment mentioned in 1.10(b), and i got following result.

V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77
Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97
Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0

i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin
999 times out of 100,000 times, i get 1 head in 10 tosses of first coin
....
77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin

and

103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random
....
97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random

and

62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins

So, it is as expected that distribution of V1 and Vrand are similar.
Can someone please explain how we should interpret result for Vmin?

Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ?

magdon 09-21-2015 04:33 PM

Re: Exercises and Problems
 
Your connection to learning is correct, that when there many hypotheses, you should be more careful about overfitting. But the main point of the exercise is to realize that if you pick a coin carefully based on the data of the flips (in this case "carefully" means having minimum nu), then the distribution of the nu you see will not be what you would expect if you tossed that *same* coin again. If you tossed that c_min again, you expect to see a binomial distribution for heads. But the distribution of nu_min is clearly not binomial.

Continuing to learning, if you pick a hypothesis "carefully", say having minimum Ein, then the Ein you get will not necessarily reflect the distribution of errors you get when you "toss that hypothesis again" -- i.e test it on new data.

Quote:

Originally Posted by giridhar1202 (Post 12031)
I conducted experiment mentioned in 1.10(b), and i got following result.

V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77
Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97
Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0

i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin
999 times out of 100,000 times, i get 1 head in 10 tosses of first coin
....
77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin

and

103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random
....
97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random

and

62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins

So, it is as expected that distribution of V1 and Vrand are similar.
Can someone please explain how we should interpret result for Vmin?

Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ?


williamshakespare 05-01-2016 08:42 AM

Re: Exercises and Problems
 
Quote:

Originally Posted by tadworthington (Post 3335)
Problem 1.4

WHAT I LIKED:
This is an excellent problem, and a model one in my opinion for helping a student to understand material. It starts out easy, with a small data set - that, importantly, is user-generated. Having to generate a data set - even one as simple as a linearly separable data set in 2 dimensions - goes a long way to helping understand how the Perceptron works and why it wouldn't work if the data were not linearly separable. We gradually progress to bigger data sets, all along the way plotting the data so that we can see what is happening. It is not only instructive but also quite exhilarating to see the results of the PLA (plotting both the target function and the hypothesis generated by the PLA) actually graphed out in front of you on a computer screen!

I also thought that the progression to 10 dimensions only after working through the more easily visualized 2 dimensional examples is perfect. Finally, the histogram approach to understanding the computational complexity of the PLA was, I thought, genius.

TIME SPENT:
Overall I spent about 90 minutes on the problem, although a lot of that was spent documenting my code for future reference, and just general "prettying" of my plots and results; so I would guess that this problem can be completed comfortably in an hour, assuming knowledge of programming in a language where plotting is easy (I used Python, but Matlab/Octave would also be excellent examples for quick-and-dirty graphics programming of this type.) For someone with no programming experience, the problem would probably take much more time.

CLARITY:
I think the problem is exceptionally clear.

DIFFICULTY:
I thought the problem was relatively easy, but the Perceptron is a relatively easy concept so I think making it harder is neither necessary nor appropriate. For metrics purposes, on a 1-10 scale of difficulty I would give this a 3.


would you like to share your answer please? Thanks

williamshakespare 05-13-2016 08:34 PM

Re: Exercises and Problems
 
Quote:

Originally Posted by htlin (Post 3338)
Thank you so much for the valuable feedback!

Mr Lin, I am studying python language in order to complete Problem 1.4.
Would you please share the answer(code with python) to provide an example for me?
Thanks.

MaxvonHippel 03-12-2018 02:08 PM

Re: Exercises and Problems - 1.3
 
For question 1.3 we are asked to argue that the move from w(t) to w(t+1) is a move "in the right direction". I think I may be misunderstanding the question and/or the figure 1.3. My impression is that the figure shows us the case of R2 (ie, d=1), but that for arbitrary Rd we are considering the case where we change w(t+1) and then argue that the resulting change in the location of the boundary plane is a move toward (x, y) and therefore more likely than not to pass by (x, y), thereby putting (x, y) on the opposite (and correct) side of the classification boundary.

This is easy enough to say in English, as I just did. But unless I'm missing something, I don't think the analytic proof is necessarily so straightforward. It seems to me that I'd have to show that the plane moves closer to the point (x, y). I think then I would have to argue that moving closer implies increasing the likelihood of passing by/through the point. (This part seems straightforward). And then I would argue that increasing the likelihood of passing by/through the point is logically equivalent to increasing the likelihood of correctly classifying the point.

Is my overarching logic here sound? And is a mathematical (specifically, analytic) proof of this argument what you are intending for an answer? Or is the intent just for the reader to formulate an English explanation such as that which I attempt to give above?

Thank you very much for the help! I am really enjoying your book.

htlin 03-17-2018 04:33 PM

Re: Exercises and Problems
 
Hint: the answer to (c) is meant to be formed from the answer to (b). Hope this helps.

MaxvonHippel 03-18-2018 05:20 PM

Re: Exercises and Problems
 
Ah, that makes sense. (I was vastly overthinking this.)
Thank you!


All times are GMT -7. The time now is 02:22 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.