LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   The Final (http://book.caltech.edu/bookforum/forumdisplay.php?f=138)
-   -   Question 14 LLoyd's algorithm - empty clusters (http://book.caltech.edu/bookforum/showthread.php?t=662)

markweitzman 06-08-2012 10:10 AM

Question 14 LLoyd's algorithm - empty clusters
 
While checking my implementation of LLoyd's algorithm, I received an error of float division by zero on one of my runs (it seems to happen rarely). I realized that if on the initial or subsequent iteration one of the clusters has zero elements then the update for mu for that clustered needs to be altered (see slide 11/20 Draft Lecture 16 for update rule). There seem to be two possibilities - either use the old value of mu for the cluster (i.e. not update it) or seed a new random value. If you use the old value then it is possible that the cluster will never get any members (for example if your points have a relatively empty region in the square and the mu for the cluster is in that region. Effectively you are using one less cluster if that happens. The alternative is to use a new random seed for that cluster and hopefully get out of the empty region but I am worried about convergence issues.

I am curious if others have encountered this issue and how they think it should be dealt with.

kkkkk 06-08-2012 10:57 AM

Re: Question 14 LLoyd's algorithm - empty clusters
 
I generated a new random center when a center has no nearby points. The algo could still converge.

yaser 06-08-2012 11:45 AM

Re: Question 14 LLoyd's algorithm - empty clusters
 
Quote:

Originally Posted by kkkkk (Post 2858)
I generated a new random center when a center has no nearby points.

This is a valid approach. You can also restart from new initial centers.

mathprof 06-09-2012 05:09 PM

Re: Question 14 LLoyd's algorithm - empty clusters
 
I'm starting Lloyd's algorithm over anytime it comes up with fewer clusters than desired, so that each iteration I'm using the full number. Thanks to this original post I realized that this is an issue. Usually Lloyd's algorithm comes up with the correct number of clusters, but occasionally something weird happens. The larger the value of K (# desired clusters), the more frequently one ends up with less than the requisite number. I was very surprised to see rare cases where only one cluster remained (Perhaps I SHOULD be surprised, as all I've noticed is a bug in my code!). Anyone else see Lloyd's algorithm occasionally (rarely) coming up with only one region?

Thanks Mark, for asking the question. Also thanks to our illustrious professor!
:-)

markweitzman 06-09-2012 08:01 PM

Re: Question 14 LLoyd's algorithm - empty clusters
 
I haven't had any problem since I used the approach of updating the mean of a cluster to a random point if the cluster was empty. And to my surprise once I figured out the parameters the scilearn python package has a routine which implements lloyds algorithm and it agreed with my own coding once I set equivalent parameters and seeds.

dudefromdayton 06-11-2012 09:45 AM

Re: Question 14 LLoyd's algorithm - empty clusters
 
It's fuzzy to me right now, but it looks like when I had no points in a cluster, its center snapped to the origin! Oops. I'm not sure this happened enough to harm my score, though.

Yellin 06-11-2012 01:58 PM

Re: Question 14 LLoyd's algorithm - empty clusters
 
Quote:

Originally Posted by yaser (Post 2859)
This is a valid approach. You can also restart from new initial centers.

I tried
a) Leave points with 0 nearby events alone, and use them as cluster centers.
b) When there was a point with 0 nearby events, restart from new initial centers.
c) Do 40 random initializations of Lloyd's algorithm and pick the one with the lowest minimum. That always (10000 trials) turned out to have no unoccupied cluster centers.

For K=12 method c) was best at preventing regular RBF from being worse than SVM (problem 16), and for K=9 b) gave a somewhat higher probability of E_in =0 for regular RBF (problem 19). But the three methods were so close together in results that they all gave the same best choice among the multiple choices for problems 15-19.

TonySuarez 09-10-2012 02:36 PM

Re: Question 14 LLoyd's algorithm - empty clusters
 
Quote:

Originally Posted by dudefromdayton (Post 2981)
It's fuzzy to me right now, but it looks like when I had no points in a cluster, its center snapped to the origin! Oops. I'm not sure this happened enough to harm my score, though.

I implemented an "abort" counter for these cases, and subtracted its final value from the number of runs in the batch to calculate the statistics.

I already ran 20 or more times the simulation, where each one has 1000 runs, for K=9, K=12, various gamma, etc. I only saw 1 abort due to an "isolated" center. So, I think that this is not an issue in statistical terms. Hope I don't have bugs in the implementation...

Regarding your observation dudefromdayton, perhaps you reset your cluster centers to zero before Lloyd's algorithm, and somehow it (or they) stays zero if no X points are near it.:clueless:

yaser 09-10-2012 05:05 PM

Re: Question 14 LLoyd's algorithm - empty clusters
 
Quote:

Originally Posted by TonySuarez (Post 5095)
I implemented an "abort" counter for these cases, and subtracted its final value from the number of runs in the batch to calculate the statistics.

I already ran 20 or more times the simulation, where each one has 1000 runs, for K=9, K=12, various gamma, etc. I only saw 1 abort due to an "isolated" center. So, I think that this is not an issue in statistical terms. Hope I don't have bugs in the implementation...

Regarding your observation dudefromdayton, perhaps you reset your cluster centers to zero before Lloyd's algorithm, and somehow it (or they) stays zero if no X points are near it.:clueless:

This time, I fixed the problem statement to discard the runs that have empty clusters in order to standardize the answer.


All times are GMT -7. The time now is 01:12 PM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.