Quote:
Originally Posted by ehaussmann
Am I missing something or is the name f(x) a bit misleading? In the slides for logistic regression f(x) is explicitly referred to as a probability. However in the exercise it is a linear function that may also take negative function values.
It seems the function f(x) corresponds to the line that is learned inside the sigmoid function, but that's not the final g(x) we are learning (which should approximate f(x) ) ?
|
The presence of f in the problem description may be a little confusing, since it hardly plays a role. f(x) is indeed a probability: P(y | x) is written in terms of it. Specifically, P(y=+1 | x) = f(x), and P(y=-1 | x) = 1 - f(x). But the problem description specifies a very simple f(x): it is always 1 on one side of the line, and always 0 on the other side. In other words, if we are on one side of the line, the "probability" that y will be +1 is exactly 1.0, and if we are on the other, the "probability" that y will be -1 is also exactly 1.0. Suppose it were a little different, say, f(x)=0.95 on one side of the line, and f(x) = 0.05 on the other. That would mean that, as you generate sample points, there is a little bit of noisiness -- the data are (probably) not linearly separable. Logistic regression is interesting because it attempts to predict probabilistic targets like this (hence the sigmoid function, which maps x*w to a probability).
In this problem, though, stochastic gradient descent is the focus. The fact that probabilities are involved is only really interesting because of the error function: if the data are linearly separable (as they are here), but E_in is not zero, then it has behaved differently from PLA, even if its goal is the same here.