I spotted a less sophisticated way of thinking about it which seems helpful to me.

If you merely assume that general points which are associated with a particular target (say +1) are more likely to be near points in a sample

with that target than further away from them, then the bigger the margin, the lower the probability that a general point with target +1 will be wrongly classified (because the bigger the minimum distance from any point in

to a point that would be classified differently).

This ties in quite intuitively with the idea of distances from support vectors (or some sort of transformed distance if kernels are used) being the basis of the hypothesis.