LFD Book Forum  

Go Back   LFD Book Forum > Course Discussions > Online LFD course > Homework 8

Reply
 
Thread Tools Display Modes
  #1  
Old 03-03-2013, 09:21 AM
Anne Paulson Anne Paulson is offline
Senior Member
 
Join Date: Jan 2013
Location: Silicon Valley
Posts: 52
Default How many support vectors is too many?

In the lectures, it's mentioned that often with an SVM you get the happy news that you have just a few support vectors, and you therefore know that your VC dimension is small and you can expect good generalization.

But how many vectors do you need before you're unhappy? Let's suppose you have a dataset with 7000 observations and 550 variables, an easy supposition for me because I do. Suppose you run an SVM and you discover that with a linear kernel, you have some 700 support vectors, and with a radial kernel, you have some 2000.

That seems like a lot; nearly half the points are support vectors. But if you attack the problem with another machine learning method like neural nets or multinomial regression, you will also have one or two thousand parameters, or maybe more, so you will also have a big VC dimension. So maybe you should be happy with the 2000 support vectors if you look like you're getting good generalization in cross-validation? Or you should be happy regardless of the number of support vectors, if the cross-validation shows good generalization? Or VC dimension doesn't matter at all if the cross-validation news is good?
Reply With Quote
  #2  
Old 03-03-2013, 12:20 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: How many support vectors is too many?

Quote:
Originally Posted by Anne Paulson View Post
In the lectures, it's mentioned that often with an SVM you get the happy news that you have just a few support vectors, and you therefore know that your VC dimension is small and you can expect good generalization.

But how many vectors do you need before you're unhappy? Let's suppose you have a dataset with 7000 observations and 550 variables, an easy supposition for me because I do. Suppose you run an SVM and you discover that with a linear kernel, you have some 700 support vectors, and with a radial kernel, you have some 2000.

That seems like a lot; nearly half the points are support vectors. But if you attack the problem with another machine learning method like neural nets or multinomial regression, you will also have one or two thousand parameters, or maybe more, so you will also have a big VC dimension.
For an absolute estimate, the theoretical result mentioned at the end of Lecture 14 (last slide) shows the generalization error to be \approx bounded by the ratio between the number of support vectors and the number of examples. For a comparative estimate, the same result implies that the number of support vectors behaves more or less like a VC dimension of other models.

Of course there are situations where neither SVM nor other models will perform to a satisfactory level, as we would expect if the resources of data are not adequate to capture the complexity of the target.

Quote:
So maybe you should be happy with the 2000 support vectors if you look like you're getting good generalization in cross-validation? Or you should be happy regardless of the number of support vectors, if the cross-validation shows good generalization? Or VC dimension doesn't matter at all if the cross-validation news is good?
Indeed, the cross validation estimate can overrule (our interpretation of) the theoretical VC bound. The VC is just a bound that applies to all cases, while cross validation provides a concrete estimate for a particular case.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #3  
Old 03-04-2013, 08:42 AM
alternate alternate is offline
Member
 
Join Date: Jan 2013
Posts: 14
Default Re: How many support vectors is too many?

Mentioning the VC dimension brings up something I considered briefly.

It's been said that when we make decisions based on seeing the data we should account for all of the options we considered when we're thinking about generalization. For example in the extreme case of data snooping, or in the lesser case where we should account for the fact that cross-validation adds a little bit of contamination.

But what about, say, a "failed" SVM? For example, we try the SVM hypothesis and get back 500 support vectors out of 1000, then decide to change the model because the first one won't generalize.

Realistically, if I then go to a different kernel or a neural network or something else, it doesn't care whether it was run before or after another model, it will produce the same result. But I could also see the interpretation where the SVM model counts as a space I explored in a similar way to tweaking parameters based on data. To what degree is that the case? Presumably there would be a tradeoff of accepting the weak model vs. accepting weaker generalization if any, which I guess could probably be automated, too.
Reply With Quote
  #4  
Old 03-04-2013, 09:23 AM
hemphill hemphill is offline
Member
 
Join Date: Jan 2013
Posts: 18
Default Re: How many support vectors is too many?

Quote:
Originally Posted by yaser View Post
For an absolute estimate, the theoretical result mentioned at the end of Lecture 14 (last slide) shows the generalization error to be \approx bounded by the ratio between the number of support vectors and the number of examples. For a comparative estimate, the same result implies that the number of support vectors behaves more or less like a VC dimension of other models.

Of course there are situations where neither SVM nor other models will perform to a satisfactory level, as we would expect if the resources of data are not adequate to capture the complexity of the target.


Indeed, the cross validation estimate can overrule (our interpretation of) the theoretical VC bound. The VC is just a bound that applies to all cases, while cross validation provides a concrete estimate for a particular case.
I have noticed that the support vector ratio sometimes provides a very good bound, at least with our toy problems. Suppose we had a situation like:

C=C_1, E_{\rm cv}=0.050, support vectors=20
C=C_2, E_{\rm cv}=0.049, support vectors=60

where the size of the data set is around 400. I would be tempted to choose C_1, regardless of E_{\rm cv}. What are your thoughts?
Reply With Quote
  #5  
Old 03-04-2013, 12:24 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: How many support vectors is too many?

Quote:
Originally Posted by alternate View Post
Mentioning the VC dimension brings up something I considered briefly.

It's been said that when we make decisions based on seeing the data we should account for all of the options we considered when we're thinking about generalization. For example in the extreme case of data snooping, or in the lesser case where we should account for the fact that cross-validation adds a little bit of contamination.

But what about, say, a "failed" SVM? For example, we try the SVM hypothesis and get back 500 support vectors out of 1000, then decide to change the model because the first one won't generalize.

Realistically, if I then go to a different kernel or a neural network or something else, it doesn't care whether it was run before or after another model, it will produce the same result. But I could also see the interpretation where the SVM model counts as a space I explored in a similar way to tweaking parameters based on data. To what degree is that the case? Presumably there would be a tradeoff of accepting the weak model vs. accepting weaker generalization if any, which I guess could probably be automated, too.
There is a theoretical approach in Structural Risk Minimization that accounts for the snooping effect when we use a hierarchy of models.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #6  
Old 03-04-2013, 12:30 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: How many support vectors is too many?

Quote:
Originally Posted by hemphill View Post
I have noticed that the support vector ratio sometimes provides a very good bound, at least with our toy problems. Suppose we had a situation like:

C=C_1, E_{\rm cv}=0.050, support vectors=20
C=C_2, E_{\rm cv}=0.049, support vectors=60

where the size of the data set is around 400. I would be tempted to choose C_1, regardless of E_{\rm cv}. What are your thoughts?
I would also do that since both estimates of E_{\rm out} are statistical in nature, so the bigger factor of safety in the two numbers of support vectors sounds safer.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #7  
Old 07-31-2013, 12:11 AM
hsolo hsolo is offline
Member
 
Join Date: Jul 2013
Posts: 12
Default Re: How many support vectors is too many?

Quote:
Originally Posted by yaser View Post
For an absolute estimate, the theoretical result mentioned at the end of Lecture 14 (last slide) shows the generalization error to be \approx bounded by the ratio between the number of support vectors and the number of examples. For a comparative estimate, the same result implies that the number of support vectors behaves more or less like a VC dimension of other models.

Of course there are situations where neither SVM nor other models will perform to a satisfactory level, as we would expect if the resources of data are not adequate to capture the complexity of the target.


Indeed, the cross validation estimate can overrule (our interpretation of) the theoretical VC bound. The VC is just a bound that applies to all cases, while cross validation provides a concrete estimate for a particular case.
Prof Yaser,

Thanks for your clarifications -- I have some questions in the context of this thread and problems 2,3 etc of the homework, and also the generalization bounds.

1. Just a reiteration for clarity -- for the bound at end of Lec 14 as applied to soft SVMs, by number of SVs we mean the number of margin SVs right? That would be consistent with the thought process that the non-margin SVs end up going to the constraints (0 or C) and so arent getting full 'freedom of expression' and therefore arent 'independent params'

2. I have been trying the problems 2,3 etc. using cvxopt : I used a simplistic rule that if an alpha is very close (say within a range of a0 = 10^-5) to 0 or C I respectively round it to 0 or C; and the remaining alphas become margin SVs. If the b's corresponding to these are consistent (meaning, within a range of b0 = 10^-3 of each other) then I conclude that there is nothing fundamentally unsound in what I am doing. I came to a0 and b0 with some trial and error. Is this a sound approach?

3. I would imagine the ranges a0 and b0 can be derived in some principled way -- for instance I didnt really account for relation between a0 and b0 and its relationship to cvxopt's numeric error tolerance -- such a principled choice of support vectors would be part of what packages like libsvm provide -- is that correct?

4. I found that in problem 2 as well as 3, some of the classifiers have single digit number of margin SVs and some ran into multiple thousands -- I am somewhat uncomfortable about this huge variation but a visual perusal of the thousands of distinct evaluations of b indicated they are all close to each other; and moreover the same code generates numbers that are essentially consistent with the margin support vectors and Ein discussed in the classification problem in the thread http://book.caltech.edu/bookforum/showthread.php?t=4044
So I am hoping I am right in assuming that the distinct values of b's being close to each other are a good indicator of soundness. Any comments?

5. Finally in the context of the discussion in Lecture 17 on Data Snooping -- We would have to use upto 100 SVMs (one versus one) or 10 SVMs (one versus rest) for our 10-way classification problem. The number of margin SVs ranges from single digit to thousands. In this context, how to view the theoretical generalization bounds? They would seem to fail the thumb rule of ratio of 10. But if the Eout on actual test data is low, we can still go ahead with this ?
Reply With Quote
  #8  
Old 07-31-2013, 02:23 AM
htlin's Avatar
htlin htlin is offline
NTU
 
Join Date: Aug 2009
Location: Taipei, Taiwan
Posts: 601
Default Re: How many support vectors is too many?

Quote:
Originally Posted by hsolo View Post
1. Just a reiteration for clarity -- for the bound at end of Lec 14 as applied to soft SVMs, by number of SVs we mean the number of margin SVs right? That would be consistent with the thought process that the non-margin SVs end up going to the constraints (0 or C) and so arent getting full 'freedom of expression' and therefore arent 'independent params'
For the CV bound, the number of SVs include the "free SVs" (with alpha between 0 and C and are on the margin), and the "bounded SVs" (with alpha = C and violating the margin).

Quote:
Originally Posted by hsolo View Post
2. I have been trying the problems 2,3 etc. using cvxopt : I used a simplistic rule that if an alpha is very close (say within a range of a0 = 10^-5) to 0 or C I respectively round it to 0 or C; and the remaining alphas become margin SVs. If the b's corresponding to these are consistent (meaning, within a range of b0 = 10^-3 of each other) then I conclude that there is nothing fundamentally unsound in what I am doing. I came to a0 and b0 with some trial and error. Is this a sound approach?
It sounds reasonable.

Quote:
Originally Posted by hsolo View Post
3. I would imagine the ranges a0 and b0 can be derived in some principled way -- for instance I didnt really account for relation between a0 and b0 and its relationship to cvxopt's numeric error tolerance -- such a principled choice of support vectors would be part of what packages like libsvm provide -- is that correct?
Numerical optimization is a difficult problem so it is difficult to define the principled way. In specific packages like LIBSVM, some careful implementations is used to carefully and stably mark at-bound alphas. In general packages for convex programming, this may not be the case.

Quote:
Originally Posted by hsolo View Post
4. I found that in problem 2 as well as 3, some of the classifiers have single digit number of margin SVs and some ran into multiple thousands -- I am somewhat uncomfortable about this huge variation but a visual perusal of the thousands of distinct evaluations of b indicated they are all close to each other; and moreover the same code generates numbers that are essentially consistent with the margin support vectors and Ein discussed in the classification problem in the thread http://book.caltech.edu/bookforum/showthread.php?t=4044
So I am hoping I am right in assuming that the distinct values of b's being close to each other are a good indicator of soundness. Any comments?
If the convex programming package is working reasonably well, the b should be close to each other for the margin SVs. So that can act as a safety check.

Quote:
Originally Posted by hsolo View Post
5. Finally in the context of the discussion in Lecture 17 on Data Snooping -- We would have to use upto 100 SVMs (one versus one) or 10 SVMs (one versus rest) for our 10-way classification problem. The number of margin SVs ranges from single digit to thousands. In this context, how to view the theoretical generalization bounds? They would seem to fail the thumb rule of ratio of 10. But if the Eout on actual test data is low, we can still go ahead with this ?
For the multiclass case, there are theoretical results that analyze the performance of one-versus-one and one-versus-rest. The results are not as simple as counting the number of internal models and it is often difficult to theoretically decide which one to be the better choice. Practically people often favor one-versus-one SVMs for computational efficiency during training, because each SVM is trained with significantly less data than the whole set.

Hope this helps.
__________________
When one teaches, two learn.
Reply With Quote
  #9  
Old 07-31-2013, 02:37 AM
hsolo hsolo is offline
Member
 
Join Date: Jul 2013
Posts: 12
Default Re: How many support vectors is too many?

Many thanks for your reply, Prof. Lin.

Quote:
Originally Posted by htlin View Post
For the CV bound, the number of SVs include the "free SVs" (with alpha between 0 and C and are on the margin), and the "bounded SVs" (with alpha = C and violating the margin).



It sounds reasonable.



Numerical optimization is a difficult problem so it is difficult to define the principled way. In specific packages like LIBSVM, some careful implementations is used to carefully and stably mark at-bound alphas. In general packages for convex programming, this may not be the case.



If the convex programming package is working reasonably well, the b should be close to each other for the margin SVs. So that can act as a safety check.



For the multiclass case, there are theoretical results that analyze the performance of one-versus-one and one-versus-rest. The results are not as simple as counting the number of internal models and it is often difficult to theoretically decide which one to be the better choice. Practically people often favor one-versus-one SVMs for computational efficiency during training, because each SVM is trained with significantly less data than the whole set.

Hope this helps.
Reply With Quote
  #10  
Old 08-01-2013, 07:20 PM
hsolo hsolo is offline
Member
 
Join Date: Jul 2013
Posts: 12
Default Re: How many support vectors is too many?

Quote:
Originally Posted by htlin View Post

Numerical optimization is a difficult problem so it is difficult to define the principled way. In specific packages like LIBSVM, some careful implementations is used to carefully and stably mark at-bound alphas. In general packages for convex programming, this may not be the case.
.
The SVM tutorial by Burges indicates that to get the threshold b in practice you just average the b's estimated from the margin support vectors. This is what seems to be done in the Matlab code in http://users.ecs.soton.ac.uk/srg/pub...ns/pdf/SVM.pdf as well.

I found that even in some of the HW cases (1 versus 5 classification, Q=5 cases) I didnt get round to a set of thresholds that are not significantly different -- I was wondering if that's unexpected or is indicative of a bug in my implementation. On the other hand if I used the above averaging approach (and used some heuristic a0 and b0 to decide the margin SVs) I would probably not be aware of this discrepancy in b values in the first place. Is there a way to get around this?
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 07:31 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.