Quote:
Originally Posted by catherine
Hi Elroch, from your comment above I understand that my test set was too small. How large should it be? How did you go about estimating the 'disagreement' between the target function and the final PLA / SVM hypotheses? According to the HW instructions, this 'disagreement' can be either calculated exactly or approximated by generating a sufficiently large set of points to evaluate it. How would you one about calculating it exactly?

Calculating it exactly involves doing some fiddly geometry to determine the area between two lines. The fiddliness is due to the fact that the lines can cross any of the sides of the square (it would be easier if the dataset was a circle, or if we knew that the crossing point was near the center of the dataset (when the angle between them would be enough)). I had a look at calculating it in an earlier homework, but decided it wasn't worth the bother.
More straightforward is to make the sample big enough. 1000 is a long way short of what you need, because all except 1020 of those points are accurately classified by both algorithms.
The uncertainty in estimates is quite apparent. Suppose you have a method and want to estimate its accuracy. In a number of runs you find an average of 10 of 1000 random points are misclassified. Each point is a perfectly random sample from a distribution which has about 1% of one value and 99% of the other. In a single run, there is huge uncertainty on this estimate: getting 5 or 15 misclassified points is going to happen. Because this is happening with the misclassified points for each of the two methods, the uncertainty in the difference between them is even larger.
The consequence is that the advantage of the better method appears a lot less when the sample is small, because of this noise in the estimates dominates a rather delicate signal.
Hence I used 100,000 random points, so that the number of misclassified points for each method was a lot more stable. Empirically, this gave quite repeatable results. The uncertainty in the misclassification error of each of the two algorithms can be estimated separately by doing a moderate number of repeat runs (eg with 10000 each) and looking at the range of values found. You can then even combine the runs together and infer a good estimate of the uncertainty on the combined run (based on the variance of the estimate being inversely proportional to the number of samples).
[could you give a link about the documentation you mentioned? I can't find a reference to "sweep" in the documentation I used at
http://cran.rproject.org/web/packag...ab/kernlab.pdf and I don't quite see what this is doing from the R documentation of this function.]