A post at another forum:

If you scale

, then the linear regression solution

scales in the opposite direction (other things being equal) since it is trying to make

match the same value (

or

). Now if you take the LR solution

and use it as initial condition for PLA, the impact of each PLA iteration scales up with

since you are adding

to the weight vector at each iteration.

Put these together and you conclude that, as

scales up and down, the impact of the LR solution vector on PLA goes down and up, respectively, and significantly so. On the large

extreme, the LR solution

behaves like the vector

so you get the original PLA iterations. As

gets smaller,

kicks in as a good initial condition (with non-trivial size) and you gain some PLA iterations. As

diminishes, PLA will take longer to correct the misclassified points that the LR

didn't get simply because the PLA iteration becomes relatively smaller in the movement that it creates.