Exercises and Problems
Please comment on the chapter problems in terms of difficulty, clarity, and time demands. This information will help us and other instructors in choosing problems to assign in our classes.
Also, please comment on the exercises in terms of how useful they are in understanding the material. 
Exercise 1.10
Since this is a theory week, thought this might be a good time to explore Hoeffding a bit.
I understand c_1 and c_rand satisfy Hoeffding experimentally as described, but conceptually does c_rand satisfy Hoeffding? For example, suppose it is unknown whether each coin is fair (or that they are known to have varying fairness  e.g. c_1 is 50/50, c_2 is 40/60, etc.). Would each coin represent a separate 'bin' or would the random selection of a coin plus the ten flips represent the randomized selection condition for c_rand? Trying to understand if it's necessary for the coins to be identical. 
Re: Exercise 1.10
Quote:

Re: Exercises and Problems
Problem 1.4
WHAT I LIKED: This is an excellent problem, and a model one in my opinion for helping a student to understand material. It starts out easy, with a small data set  that, importantly, is usergenerated. Having to generate a data set  even one as simple as a linearly separable data set in 2 dimensions  goes a long way to helping understand how the Perceptron works and why it wouldn't work if the data were not linearly separable. We gradually progress to bigger data sets, all along the way plotting the data so that we can see what is happening. It is not only instructive but also quite exhilarating to see the results of the PLA (plotting both the target function and the hypothesis generated by the PLA) actually graphed out in front of you on a computer screen! I also thought that the progression to 10 dimensions only after working through the more easily visualized 2 dimensional examples is perfect. Finally, the histogram approach to understanding the computational complexity of the PLA was, I thought, genius. TIME SPENT: Overall I spent about 90 minutes on the problem, although a lot of that was spent documenting my code for future reference, and just general "prettying" of my plots and results; so I would guess that this problem can be completed comfortably in an hour, assuming knowledge of programming in a language where plotting is easy (I used Python, but Matlab/Octave would also be excellent examples for quickanddirty graphics programming of this type.) For someone with no programming experience, the problem would probably take much more time. CLARITY: I think the problem is exceptionally clear. DIFFICULTY: I thought the problem was relatively easy, but the Perceptron is a relatively easy concept so I think making it harder is neither necessary nor appropriate. For metrics purposes, on a 110 scale of difficulty I would give this a 3. 
Re: Exercises and Problems
Quote:

Re: Exercise 1.10
While comparing the CoinFlip experiment with the BinModel
a) Sample size of the individual Bin is equal to the number of coin flips (N=10) b) Number of hypothesis is equal to number of coins (M=1000) but, the aspect of repeating the entire coinflip experiment large number of times (100,000) does not have any counterpart with the (Single or Multiple) BinModel. Is the purpose of repeating the coinflip experiment only to find the estimates of P[vmu>epsilon] without using the hoeffding inequality and then comparing it with the bound given by hoeffding ? Kindly clarify this point. Thanks. 
Problem 1.10 : Expected Off Training Error
Hi,
If I got it right, in a noiseless setting, for a fixed D, if all f are equally likely, the expected offtrainingerror of any hypothesis h is 0.5 (part d of problem 1.10, page 37) and hence any two algorithms are the same in terms of expected off training error (part e of the same problem). My question is, does this not contradict the generalization by Hoeffding. Specifically, the following point is bothering me By Hoeffding : Ein approaches Eout for larger number of hypothesis (i.e for small epsilon) as N grows sufficiently large. Which would imply that expected(Eout) should be approximately the same as expected(Ein) and not a constant (0.5). Can you please provide some insight on this, perhaps my comparison is erroneous. Thanks. 
Re: Problem 1.10 : Expected Off Training Error
Quote:
On face value, the statement "all are equally likely" sounds reasonable. Mathematically, it corresponds to trying to learn a randomly generated target function, and getting the average performance of this learning process. It should not be a surprise that we would get 0.5 under these circumstances, since almost all random functions are impossible to learn. In terms of and , Hoeffding inequality certainly holds for each of these random target functions, but itself will be close to 0.5 for almost all of these functions since they have no pattern to fit, so Hoeffding would indeed predict to be close to 0.5 on average. This is why learning was decomposed into two separate questions in this chapter. In terms of these two questions, the one that "fails" in the random function approach is "?" Let me finally comment that treating "all are equally likely" as a plausible statement is a common trap. This issue is addressed in detail in the context of Bayesian learning in the following video segment: http://work.caltech.edu/library/182.html 
Re: Problem 1.10 : Expected Off Training Error
My takeaway point : All f are equally likely corresponds to trying to learn a randomly generated target function
Thanks for the detailed explanation. The Bayesian Learning example highlights the ramifications of this assumption, very useful point indeed. 
Re: Problem 1.10 : Expected Off Training Error
Hoeffding applies to a single hypothesis . To take a concrete case of the setting in the problem, suppose that is obtained by flipping a fair coin for every data point to obtain . The problem proves that your expected off training set error is 0.5; no surprise, since is random
Hoeffding tells you that your training error (if N is large enough) will not deviate far from Eout with very high probability. This means that with very high probability, Ein will be close to 0.5. Problem 1.10 says that if you are in such a pathalogical learning situation with a random , then no matter what you do insample, your outofsample performance will be very bad. (This is sometimes called nofreelunch, because you cannot expect to succeed unless you assume that f is not completely random). Hoeffding (and VC) says that provided your hypothesis set is not too large compared to N, you will know that you are in a bad situation because your Ein will reflect Eout and be close to 0.5. The nature of your with respect to will determine how good your Eout; if is very bad, then Eout will be close to 0.5. Hoeffding role is to ensure that Ein will tell you that you are in this tough situation. To summarize: problem 1.10 identifies one case in which you are in trouble and Eout will be 0.5. Hoeffeing tells you when you will be able to know that you are in trouble. Quote:

Re: Problem 1.10 : Expected Off Training Error
Thank you

Exercise 1.5
The exercise is to determine whether each problem is more suited for the learning or design approach.
Part a "Determining the age at which a particular medical test is recommended" seems like an information retrieval problem not a learning or design problem. Doctors have already decided when to recommend a test. Perhaps it should read: "Determining the age at which a particular medical test should be recommended or performed" to make it a learning problem. 
Re: Exercise 1.5
Yes, the use of "is" makes it seem like the answer is known and just needs to be retrieved; "should be" would have been a better choice of words, thanks.
Quote:

Re: Exercises and Problems
With respect to the discussion on Prob 1.10 thread here, it is noted that learning is not possible when the target function is a random function. On that note, what crosses my mind is how does one reconcile that statement with the thought that a purely random function could have more characterizable features after transformation to other domains? For eg., white noise maps to constant in spectral domain. Perhaps naively, if I had the a set of images with a reasonable proportion of fully noisy images, and choose to apply the ML apparatus in the frequency domain, I could learn a system to classify noisy vs notnoisy? Right?
It's very likely I'm picking on a point out of context here? 
Re: Exercises and Problems
I suppose I'm interpreting "noise" rather loosely? When I said white noise maps to constant in the frequency domain, I was thinking Gaussian noise, and in which case, we have a case where the output can be characterized as a PDF, and hence not quite "noise" since it can be probabilistically characterized? :clueless:

Re: Exercises and Problems
It is true that white noise in the time domain is uniform in the frequency domain. But I do not quite follow how that relates to the prediction task, which in your example can be formulated as follows:
A target function f(x) is white noise. You are given f(x_n) on some data points x_n and asked to learn how to predict f on a test data point x_test. Your best prediction is 0 and your error will be distributed according to a Gaussian, no matter what your training data were, which is by definition of white noise. This would be your prediction even before you saw any training data, so you have not improved your prediction (i.e. you have not been able to learn) from the training data. What bounds like Hoeffding tell you is that you will find out from your training data that your error will be large. The same happens in classification where the function is a random . The data does not help you to predict outside the data, but at least you will learn from your data that you cannot predict outside the data. It is better than learning nothing :). Quote:

Re: Exercises and Problems
Thank you for your response and explanation. Mea culpa. I was thinking a bit too generally, perhaps loosely too, when I should have thought through an example such as the one you note.
In passing, let me restate what I had in mind in my earlier post. To begin with, I shouldn't have used the word frequency, but rather, I should have said wavelength since I used images as my example. I thought, if I had perfectly noisy set of images as input, and my end goal was to classify noisy vs notnoisy images, I would have a case where I attempt to learn the problem after a 2d Fourier transform of the images. For sake of discussion, in the transformed space, we'd see a flatish surface corresponding to various (spatial wavelengths, ie equal energy in all wavelengths since it is a Fourier transformed Gaussian), and basically a constant/flat function (glossing over a lot of practical points such as tapering to keep it bandlimited). I would use this smooth *characterization of noise* as an input to the learning algorithm, and eventually classify new images. Ie, if in the wavelength space the transformed image is flattish => it is "noisy" image, else => it is a "notnoisy" image. In my mind, I started out with a target function which was completely random, ie noisy, and I used the Fourier transformed space to turn it into a nearconstant (smooth) function which I could use for classification. And presumably, I one managed to learn using a perfectly noisy image set. This thinking prompted my question, which questioned why random functions can't be learned. Needless adding, I did not map my thinking properly enough; the target function for this binary classification isn't the 2d Fourier transform; and further, I also mapped the input space (ie image pixel location) onto a wavelength space. I like your simple example. Thank you. 
Re: Exercises and Problems
Good example top clarify an important issue. In your example, the target function is not random. It assigns to a signal that is white noise a 1 and to a signal that is not white noise (say, containing a real image) a +1. A random target would assign random +1,1 to the signal.
What you have described is what we call feature construction and you will study it later in the book. Specifically, given the input signal construct the fourrier transform and take (for example) the variance of this fourier signal. This variance is a good feature of the input for determining (small variance would indicate a noise signal). Good feature construction is a very important part of learning from data successfully. Quote:

Re: Exercises and Problems
Yes, I do see your point that in my example even though the input images may be completely noisy, the target function in my construct isn't random. Thank you for your explanations. I do appreciate it.
My apologies for distorting this thread by my inquiry! 
Re: Exercises and Problems
Question about the pocket algorithm described on p. 80 and illustrated on p. 83.
For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse. Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector. My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process. Can you please comment? 
Re: Exercises and Problems
Quote:
There are two vectors that the pocket algorithm keeps track of. The "pocket weight vector" is the one that is not updated if gets worse. However, the other weight vector , which is identical to its counterpart in PLA, keeps getting updated with every iteration using a misclassified point. The algorithm then reports as its final solution. 
Re: Exercise 1.10
On exercise 1.10 (c).. I'm getting confused on how to plot P[vu>e] as a function of e... I assumed the distribution is binomial right?

Re: Exercises and Problems
I conducted experiment mentioned in 1.10(b), and i got following result.
V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77 Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97 Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0 i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin 999 times out of 100,000 times, i get 1 head in 10 tosses of first coin .... 77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin and 103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random .... 97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random and 62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins So, it is as expected that distribution of V1 and Vrand are similar. Can someone please explain how we should interpret result for Vmin? Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ? 
Re: Exercises and Problems
Your connection to learning is correct, that when there many hypotheses, you should be more careful about overfitting. But the main point of the exercise is to realize that if you pick a coin carefully based on the data of the flips (in this case "carefully" means having minimum nu), then the distribution of the nu you see will not be what you would expect if you tossed that *same* coin again. If you tossed that c_min again, you expect to see a binomial distribution for heads. But the distribution of nu_min is clearly not binomial.
Continuing to learning, if you pick a hypothesis "carefully", say having minimum Ein, then the Ein you get will not necessarily reflect the distribution of errors you get when you "toss that hypothesis again"  i.e test it on new data. Quote:

Re: Exercises and Problems
Quote:
would you like to share your answer please? Thanks 
Re: Exercises and Problems
Quote:
Would you please share the answer(code with python) to provide an example for me? Thanks. 
Re: Exercises and Problems  1.3
For question 1.3 we are asked to argue that the move from w(t) to w(t+1) is a move "in the right direction". I think I may be misunderstanding the question and/or the figure 1.3. My impression is that the figure shows us the case of R2 (ie, d=1), but that for arbitrary Rd we are considering the case where we change w(t+1) and then argue that the resulting change in the location of the boundary plane is a move toward (x, y) and therefore more likely than not to pass by (x, y), thereby putting (x, y) on the opposite (and correct) side of the classification boundary.
This is easy enough to say in English, as I just did. But unless I'm missing something, I don't think the analytic proof is necessarily so straightforward. It seems to me that I'd have to show that the plane moves closer to the point (x, y). I think then I would have to argue that moving closer implies increasing the likelihood of passing by/through the point. (This part seems straightforward). And then I would argue that increasing the likelihood of passing by/through the point is logically equivalent to increasing the likelihood of correctly classifying the point. Is my overarching logic here sound? And is a mathematical (specifically, analytic) proof of this argument what you are intending for an answer? Or is the intent just for the reader to formulate an English explanation such as that which I attempt to give above? Thank you very much for the help! I am really enjoying your book. 
Re: Exercises and Problems
Hint: the answer to (c) is meant to be formed from the answer to (b). Hope this helps.

Re: Exercises and Problems
Ah, that makes sense. (I was vastly overthinking this.)
Thank you! 
Re: Exercise 1.10
Quote:
The definition and subsequent analysis makes the assumption that there is a single target function f which we are trying to approximate. Clearly, if there are multiple target functions then we don't satisfy the initial assumptions for using the learning algorithm. 
All times are GMT 7. The time now is 04:17 AM. 
Powered by vBulletin® Version 3.8.3
Copyright ©2000  2022, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. AbuMostafa, Malik MagdonIsmail, and HsuanTien Lin, and participants in the Learning From Data MOOC by Yaser S. AbuMostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.