In the course, one interesting thing we looked at was using a tool designed for linear regression to attack classification problems. It proved of some use but less than ideal, for reasons which included the inappropriate cost function (and the inherent awkwardness of approximating a discontinuous step-like function with a linear function). In hindsight this was highly relevant to my own experiments with doing exactly this in the past, if only to see why it was not that great an idea!

But there is a very precise mapping between any regression problem and a classification problem. This mapping is made by including an extra dimension for the domain of the function, and considering the function we are trying to approximate as the boundary of the classification problem.

With this view, every data point we have can be considered to provide an infinite number of data points for classification. If

, we know

and

for any

Looking in the other direction, there is a very interesting observation (at least, to me). Suppose we know the value of

must lie in some interval

, then we require one bit of information about f(x) for each halving of this range. Explicitly, we pick the midpoint of the range, classify the point as above or below

and that tells us what we want. We think of functions as infinitely precise but, in the real world, information is finite, and of finite accuracy, and this gives us an explicit relationship between the

bits associated with classification of

points and a finite precision

of a function.

Anyhow, what I am interested in here is the use of any classification methodology to attack regression problems by this mapping. I am not sure how this relates to, say, the use of SVMs for regression. I imagine it might be efficient to turn each data point for a regression problem into two points for classification, by adding and subtracting a small quantity and considering them to be classified above and below the function surface, but it would likely be useful to have other points to get a better relationship between the error functions. With the above idea in mind, one possibility is to space the points for classification in geometric progression (getting more widely spaced away from the function/boundary), so that classification errors are directly related to the number of bits of accuracy of the resulting regression function! Other spacings of the points can be chosen to correspond to any pointwise error function.

Can anyone clarify how these ideas relate to the current state of knowledge and technology?