#1




Kernel methods for SVM and quantum computing
I'm posting this even though I don't have an intelligent question to ask. Only this: I recently took a MOOC on quantum computing (Vazirani at Berkeley from coursera) and then this course, and I'm a little struck by the similarity between the two subjects. If I learned anything in that course, it was that in quantum computing you have an infinite number of parallel processors available for your calculations  but unfortunately no way to get all their results. What you can do is some kind of compression of all those calculations into a single (set of) numbers, like the Fourier transform of all those wave functions sampled at a particular frequency, or other somewhat similar stuff. Then, if you're very lucky, you find that that sampling value will answer some important question. They've managed to find compressions that work to factor large numbers, search N boxes in log N steps, and a number of other interesting calculations that would take huge computing power any other way.
_Anyhow_, I was struck by the professor's explanation of kernel methods, which really sounded exactly the same. Infinite dimensional vector space out there, we're searching it, but we don't need to go there, just use a simple calculation of the kernel/dot product which gives us the essential information we need from that space... Here I ought to ask a question, but I don't know what it should be. Maybe, can SVMs be a method of gathering information back from the QC multiuniverses? 
#2




Re: Kernel methods for SVM and quantum computing
My longterm background is from research into functional analysis, and I find the Hilbert Space formulation of quantum mechanics a satisfying one, as well as partial glimpses of the harder theories that lead from relativistic quantum mechanics to things like symmetry breaking. Quantum computing fits a lot more naturally in this more abstract formulation than ones which can lead to conflicts with intuition.
I too have been mulling over the analogy between probabilistic inference in machine learning and the uncertainty of quantum mechanics and think an interesting example is to be found in an example used in the lectures. Suppose you are presented with 2 data points as samples of a function from . Having done this course we know that fitting a straight line through the points would not be a great idea, as it is likely to be overfitting. Given no other information, we have two plausible possible choices as to what to do, and unless we want to be convicted of data snooping, we had better have decided on which one of them to use before looking at the data points. Machine A uses the hypothesis set of constant functions and fits the two points with their average. Machine B uses the hypothesis set of lines through the origin and fits the two points using least squares regression on a single parameter, the slope. Having used either of these machines we have a model and we can argue that it is the best model of those in our hypothesis set, but it's not possible to combine these two pieces of knowledge to arrive at something better. This is a (perhaps poor) analogy of the concept of incompatible observations in quantum mechanics, where we can make observations of different types and make inferences from them, but not simultaneously. It's sort of like our window on the object is very small (two points of a function  one spin axis for angular momentum), and we have a choice of what we can look at through it (the mean or the slope  spin about just one axis). It is a weakness of the analogy that an observation in quantum mechanics destroys information (strictly speaking, it moves from the subsystem being observed to entanglement between the measurement device and the subsystem, in a closely related way to "spooky" relationships central to quantum computers) but applying a machine learning algorithm to choose a hypothesis doesn't seem to destroy anything. Perhaps a closer analogy to the "information destruction" of an observation is the pollution of objectivity by data snooping. The observer may be obliged to be part of the experiment. [EDIT: the crudeness of my analogy is made clear by the fact that the full linear hypothesis set may be the appropriate one even if we are only given two data points. This is most obvious if the target hypothesis is some unknown linear function and our observations are noiseless, but also true in other cases that are approximations to this. But I still think that the idea of independent, mutually incompatible inferences is one that could be made precise with a little careful construction. An interesting challenge would be to construct an example where two incompatible hypothesis sets achieve identical out of sample performance  in the lectures, OOS errors varied between different hypothesis sets, so there was a single best one. ]. 
#3




Re: Kernel methods for SVM and quantum computing
Further studies have led me to what looks to me like a nice analogy between quantum computing and machine learning. This arises from a field of machine learning which we didn't look at much in the course: Baysian learning.
With the assumption of the prior distribution , this provides us, in principle, with the following. We have a parametrised hypothesis set, each providing outputs, or a probability distribution of outputs: with some prior distribution for these hypotheses (the part Yaser described as "robbing a bank" in his lecture). We are now given a set of inputs . For any individual hypothesis we can work out the probability distribution of possible outputs from a particular (in some cases this will be either 1 or 0, in others it will be a general probability). Then we apply Bayes rule to the prior and get a probability distribution of the hypotheses that could give that output. Turning the handle gives us a probability distribution for the outputs. The relevance to this discussion is that we are effectively applying every possible hypothesis in parallel. The s describes the "state" of the hypothesis, but we never "collapse" it into a single state. Using this approach you can do things like use 10degree polynomials as a hypothesis set, get 3 data points, and return a probability distribution of the value of the function at any other value. Or you could do the same while incorporating some knowledge of the uncertainty in the three data points. I'll make the observation that although there is the huge philosophical problem of the necessity to come up with a prior distribution, this seems to parallel the practical usefulness of regularisation in the machine learning we used. We did it, but what were we doing? In what way is the magnitude of a coefficient anything to do with the error function? As well as that, we implicitly use maximum likelihood estimation which is a philosophical leap as well. To me the contrast:

Thread Tools  
Display Modes  

