I believe I recall reading that the range of functions which can be approximated to any given accuracy with multi-layer networks is the same as the range achievable with networks with just 2 hidden layers. However, networks with one hidden layer are limited to approximating a more restricted (but also rather general) range of functions (which, on checking, I find consists of

continuous functions on compact subsets of

)

Of course this doesn't preclude networks with a greater number of hidden layers being better in some other definable sense. [Thinking of natural neural networks such as those in our brains, it is natural for these to be very deep, using multiple levels of processing feeding into each other].

Regarding design of neural networks, I've experimented with them on several occasions over several years and have applied rules of thumb for design. As well as generally limiting hidden layers to 2, one idea concerns how much data you need to justify using a certain complexity of neural network. While it is normal to use validation to stop training when overfitting occurs, I suspect there is no advantage to having lots of neurons if stopping occurs too early to make good use of them.

One practical formula is:

where

is the tolerable error.

Sorry I can't locate where this came from: perhaps someone else knows?

There is also theoretical work on estimating the VC-dimensions of NNs, such as "Vapnik-Chervonenkis Dimension of Neural Nets" by Peter L. Bartlett.