LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 1 - The Learning Problem

Reply
 
Thread Tools Display Modes
  #21  
Old 01-26-2013, 10:04 PM
srport@pacbell.net srport@pacbell.net is offline
Junior Member
 
Join Date: Jan 2013
Posts: 1
Default Re: Exercises and Problems

Question about the pocket algorithm described on p. 80 and illustrated on p. 83.

For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse.

Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector.

My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process.

Can you please comment?
Reply With Quote
  #22  
Old 01-26-2013, 10:56 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,472
Default Re: Exercises and Problems

Quote:
Originally Posted by srport@pacbell.net View Post
Question about the pocket algorithm described on p. 80 and illustrated on p. 83.

For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse.

Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector.

My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process.

Can you please comment?
Hi,

There are two {\bf w} vectors that the pocket algorithm keeps track of. The "pocket weight vector" \hat{\bf w} is the one that is not updated if E_{\rm in} gets worse. However, the other weight vector {\bf w}(t), which is identical to its counterpart in PLA, keeps getting updated with every iteration using a misclassified point. The algorithm then reports \hat{\bf w} as its final solution.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #23  
Old 04-15-2013, 08:32 AM
mariovela mariovela is offline
Junior Member
 
Join Date: Apr 2012
Posts: 5
Default Re: Exercise 1.10

On exercise 1.10 (c).. I'm getting confused on how to plot P[|v-u|>e] as a function of e... I assumed the distribution is binomial right?
Reply With Quote
  #24  
Old 09-09-2015, 10:17 AM
giridhar1202 giridhar1202 is offline
Junior Member
 
Join Date: Sep 2015
Posts: 2
Default Re: Exercises and Problems

I conducted experiment mentioned in 1.10(b), and i got following result.

V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77
Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97
Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0

i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin
999 times out of 100,000 times, i get 1 head in 10 tosses of first coin
....
77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin

and

103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random
....
97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random

and

62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins

So, it is as expected that distribution of V1 and Vrand are similar.
Can someone please explain how we should interpret result for Vmin?

Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ?
Reply With Quote
  #25  
Old 09-21-2015, 03:33 PM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Exercises and Problems

Your connection to learning is correct, that when there many hypotheses, you should be more careful about overfitting. But the main point of the exercise is to realize that if you pick a coin carefully based on the data of the flips (in this case "carefully" means having minimum nu), then the distribution of the nu you see will not be what you would expect if you tossed that *same* coin again. If you tossed that c_min again, you expect to see a binomial distribution for heads. But the distribution of nu_min is clearly not binomial.

Continuing to learning, if you pick a hypothesis "carefully", say having minimum Ein, then the Ein you get will not necessarily reflect the distribution of errors you get when you "toss that hypothesis again" -- i.e test it on new data.

Quote:
Originally Posted by giridhar1202 View Post
I conducted experiment mentioned in 1.10(b), and i got following result.

V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77
Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97
Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0

i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin
999 times out of 100,000 times, i get 1 head in 10 tosses of first coin
....
77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin

and

103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random
....
97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random

and

62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins

So, it is as expected that distribution of V1 and Vrand are similar.
Can someone please explain how we should interpret result for Vmin?

Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ?
__________________
Have faith in probability
Reply With Quote
  #26  
Old 05-01-2016, 07:42 AM
williamshakespare williamshakespare is offline
Junior Member
 
Join Date: Mar 2016
Posts: 6
Default Re: Exercises and Problems

Quote:
Originally Posted by tadworthington View Post
Problem 1.4

WHAT I LIKED:
This is an excellent problem, and a model one in my opinion for helping a student to understand material. It starts out easy, with a small data set - that, importantly, is user-generated. Having to generate a data set - even one as simple as a linearly separable data set in 2 dimensions - goes a long way to helping understand how the Perceptron works and why it wouldn't work if the data were not linearly separable. We gradually progress to bigger data sets, all along the way plotting the data so that we can see what is happening. It is not only instructive but also quite exhilarating to see the results of the PLA (plotting both the target function and the hypothesis generated by the PLA) actually graphed out in front of you on a computer screen!

I also thought that the progression to 10 dimensions only after working through the more easily visualized 2 dimensional examples is perfect. Finally, the histogram approach to understanding the computational complexity of the PLA was, I thought, genius.

TIME SPENT:
Overall I spent about 90 minutes on the problem, although a lot of that was spent documenting my code for future reference, and just general "prettying" of my plots and results; so I would guess that this problem can be completed comfortably in an hour, assuming knowledge of programming in a language where plotting is easy (I used Python, but Matlab/Octave would also be excellent examples for quick-and-dirty graphics programming of this type.) For someone with no programming experience, the problem would probably take much more time.

CLARITY:
I think the problem is exceptionally clear.

DIFFICULTY:
I thought the problem was relatively easy, but the Perceptron is a relatively easy concept so I think making it harder is neither necessary nor appropriate. For metrics purposes, on a 1-10 scale of difficulty I would give this a 3.

would you like to share your answer please? Thanks
Reply With Quote
  #27  
Old 05-13-2016, 07:34 PM
williamshakespare williamshakespare is offline
Junior Member
 
Join Date: Mar 2016
Posts: 6
Default Re: Exercises and Problems

Quote:
Originally Posted by htlin View Post
Thank you so much for the valuable feedback!
Mr Lin, I am studying python language in order to complete Problem 1.4.
Would you please share the answer(code with python) to provide an example for me?
Thanks.
Reply With Quote
Reply

Tags
hoeffding's inequality, hoeffding-inequality

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 07:44 AM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.