![]() |
#21
|
|||
|
|||
![]()
Question about the pocket algorithm described on p. 80 and illustrated on p. 83.
For data that is not linearly separable, I originally thought the pocket algorithm would run exactly like the PLA and simply take the best performing w vector over the entire iteration range and report it back. However, the description on p. 80 says that the w vector is not updated if Ein is worse. Suppose after a certain number of iterations that the set of misclassified points is such that every misclassified point used to perform an update returns a worse performing w vector. In this case, the algorithm will never be able to find a better w vector. My experience watching the PLA as it iterates is that it seems to jump about, and eventually (randomly) converges (with linearly separable data). I've seen proofs on the internet that it will converge, but do not really understand them. My point is that each iteration of the PLA does not necessarily improve g(x), but that in order to reach the best g(x), this jumping about is part of the process. Can you please comment? |
#22
|
||||
|
||||
![]() Quote:
There are two ![]() ![]() ![]() ![]() ![]()
__________________
Where everyone thinks alike, no one thinks very much |
#23
|
|||
|
|||
![]()
On exercise 1.10 (c).. I'm getting confused on how to plot P[|v-u|>e] as a function of e... I assumed the distribution is binomial right?
|
#24
|
|||
|
|||
![]()
I conducted experiment mentioned in 1.10(b), and i got following result.
V1 = 90, 999, 4375, 11803, 20329, 24793, 20411, 11685, 4450, 988, 77 Vrand = 103, 1022, 4389, 11691, 20444, 24489, 20653, 11669, 4502, 941, 97 Vmin = 62376, 37622, 2, 0, 0, 0, 0, 0, 0, 0, 0 i.e; 90 times out of 100,000 times, i get 0 heads in 10 tosses of first coin 999 times out of 100,000 times, i get 1 head in 10 tosses of first coin .... 77 times out of 100,000 times, i get 10 heads in 10 tosses of first coin and 103 times out of 100,000 times, i get 0 heads in 10 tosses of coin chosen at random .... 97 times out of 100,000 times, i get 10 heads in 10 tosses of coin chosen at random and 62376 times out of 100,000 times, i get 0 heads in 10 tosses of coin for which number of heads was minimum across 1000 coins So, it is as expected that distribution of V1 and Vrand are similar. Can someone please explain how we should interpret result for Vmin? Is distribution of Vmin suggesting that one should be careful about overfitting ? Because if we have many hypothesis there will almost always be some hypothesis which fits data set exactly. So what should be one's strategy for selecting hypothesis ? |
#25
|
||||
|
||||
![]()
Your connection to learning is correct, that when there many hypotheses, you should be more careful about overfitting. But the main point of the exercise is to realize that if you pick a coin carefully based on the data of the flips (in this case "carefully" means having minimum nu), then the distribution of the nu you see will not be what you would expect if you tossed that *same* coin again. If you tossed that c_min again, you expect to see a binomial distribution for heads. But the distribution of nu_min is clearly not binomial.
Continuing to learning, if you pick a hypothesis "carefully", say having minimum Ein, then the Ein you get will not necessarily reflect the distribution of errors you get when you "toss that hypothesis again" -- i.e test it on new data. Quote:
__________________
Have faith in probability |
#26
|
|||
|
|||
![]() Quote:
would you like to share your answer please? Thanks |
#27
|
|||
|
|||
![]()
Mr Lin, I am studying python language in order to complete Problem 1.4.
Would you please share the answer(code with python) to provide an example for me? Thanks. |
#28
|
|||
|
|||
![]()
For question 1.3 we are asked to argue that the move from w(t) to w(t+1) is a move "in the right direction". I think I may be misunderstanding the question and/or the figure 1.3. My impression is that the figure shows us the case of R2 (ie, d=1), but that for arbitrary Rd we are considering the case where we change w(t+1) and then argue that the resulting change in the location of the boundary plane is a move toward (x, y) and therefore more likely than not to pass by (x, y), thereby putting (x, y) on the opposite (and correct) side of the classification boundary.
This is easy enough to say in English, as I just did. But unless I'm missing something, I don't think the analytic proof is necessarily so straightforward. It seems to me that I'd have to show that the plane moves closer to the point (x, y). I think then I would have to argue that moving closer implies increasing the likelihood of passing by/through the point. (This part seems straightforward). And then I would argue that increasing the likelihood of passing by/through the point is logically equivalent to increasing the likelihood of correctly classifying the point. Is my overarching logic here sound? And is a mathematical (specifically, analytic) proof of this argument what you are intending for an answer? Or is the intent just for the reader to formulate an English explanation such as that which I attempt to give above? Thank you very much for the help! I am really enjoying your book. |
#29
|
||||
|
||||
![]()
Hint: the answer to (c) is meant to be formed from the answer to (b). Hope this helps.
__________________
When one teaches, two learn. |
#30
|
|||
|
|||
![]()
Ah, that makes sense. (I was vastly overthinking this.)
Thank you! |
![]() |
Tags |
hoeffding's inequality, hoeffding-inequality |
Thread Tools | |
Display Modes | |
|
|