Quote:
Originally Posted by Randy
Actually, w0 never converges to anything meaningful using the w = w + y*x update rule because x0 = 1 for all the data points and y is constrained to be either 1 or +1... so w0 just flips between 1 and zero forever without converging.

It's range is a bit more expansive than just 1 and 0. For instance, I ran PLA on 100 different instances with 23 training points, and kept track of all values w0 took on, and how many times it took on each value, and this was the result:
Code:
4.0 6
3.0 76
2.0 168
1.0 329
0.0 603
1.0 538
2.0 139
3.0 43
4.0 26
5.0 15
6.0 1
This happens because even though when w0 gets large in magnitude it biases the output to match the sign of w0, that can still be overcome by w1 and w2, and so you can get, say, a 1 point misclassified as +1 even if w0 is very negative. Same for large positive w0.
An interesting question would be what is the distribution of the values w0 can reach. It's not like a simple random walk, I don't think, because the farther it gets out the less likely it will step farther from 0 I believe.
Maybe a Gaussian? That doesn't seem exact eitherlooking at a few of these, they seem to be skewed rather than symmetric, but I didn't look at a lot of samples so this could just be normal random variation.