![]() |
Re: Machine Learning and census models
My opinion is: Without enough data, it would be difficult for Machine Learning/Data Mining or even Human Intelligence to reach any conclusive statement about the process behind. Hope this helps.
|
Re: Machine Learning and census models
How about this approach:
1. Don't look at the data! If you have looked at the data, find a machine learning expert who has not looked at the data and ask him to do it for you. [Unless you have some method of forgetting what you have seen, that is.] 2. Pick a learning method suited to the size of the data set and use leave-one-out cross-validation to find the optimal hypothesis. The interesting question is what learning method in the second part. Something pretty general and regularized. 3. Bear in mind that extrapolation of non-stationary processes is not necessarily possible (the cross validation has an easy time of it, because most of the data points are internal). |
Re: Machine Learning and census models
On reflection, I suspect that if the aim is extrapolation into the future a more principled alternative to function approximation with leave-one-out cross-validation might be a variation where the in sample data for each run consists only of the data preceding the out of sample data point. (The reason is that as fitting intermediate points is easier, it could easily lead to overfitting).
Has anyone any views on this rather important general situation? |
Re: Machine Learning and census models
Quote:
![]() ![]() ![]() |
Re: Machine Learning and census models
Quote:
However, with much more data than the 14 points described here, methods of the type I have researched in the past use two levels of design (similar to those discussed in your lectures and examples about cross-validation) to select the class of model (but generally for hyperparameters, rather than choice of the order of a polynomial approximation). The input data may be ![]() ![]() The aim is still to be able to predict the output from some new set of ![]() One approach looks at the several thousand data points as a sample from a distribution and applies cross validation followed by training with all of the data to generate a hypothesis for that distribution. This seems fine, but has an implicit assumption of stationarity. When intermediate points are being used to validate, future points are being used to generate the model. But the behaviour of the system in the future may depend on how it has behaved at intermediate points (this is how people make their decisions about what to do, and what their programs are using as data). Hence there is a subtle cheat going on here related to non-stationarity. Whether it is harmful or not, I am not yet sure. This is how I arrived at the alternative approach of only using validation data that is in the future of the data used to select hyperparameters, as well as the generation of the final model. This is something rather like cross validation, but different (as the data used in training is restricted). The choice of the training and validation window sizes is interesting: in both cases there is a compromise between wanting it to have lots of points, but to not extend over too much time (due to non-stationarity). It has some similarity to a method used by technical traders called walk-forward optimisation which uses future out of sample errors to validate a method. This modified approach is not necessarily completely immune from non-stationarity. It relies on the models being used being sophisticated enough to capture the changing behaviour of the system as a whole, so may fail if this becomes not true at some time. I am not aware if there is much published on the way in which non-stationarity plays a role in this scenario. One thing that strikes me is the way that the selection of hyperparameters through cross-validation (or the variant above) and the generation of a predictive hypothesis breaks the learning process up into two separate parts in quite a surprising way (even though a very neat and effective one). I wonder whether there are any other alternative ways to organise the overall learning process that might give good results? |
Re: Machine Learning and census models
Thank you Professor Lin, Professor Yasser and Elroch.
Upon reading Elroch's response where he talks about stationarity, it ocurrs to me that this situation (population growth models) are perhaps non-stationary by nature: the way population grows is not independent of time. Plus, I don't think that in this context, the issue is to predict a future population level based on T past observations because the future level would depend on the time variable . Of course, I'm saying all this intuitively and presently, I'm not able to pin it down mathematically. This "intuition" is based on some ideas I've picked up about demographic transition theory. To summarize: a profound economical, technological or ecological development upsets the social structure of a population and its reproductive behavior, having the effect of raising (or lowering) the carrying capacity parameter (K). The population level then transitions into this point of stability (K) in a smooth fashion as described by Adolphe Quetelet and Pierre Verhulst about 200 years ago. The newer population models are just "tweaks" elaborating on this idea: modifying the point in time where the population growth curve reaches the inflection point, etc. With the advent of the industrial revolution and modernization in almost all countries, population exploded due to better sanitary and conditions and nutrition. But then, the cost of having children has grown (because children require more years of school, providing for clothes, food, and are basically inproductive for the first 40 :p years of their lives). Consequently, population growth eventually slows down and reaches a stable point. My belief - and from what I gather this was implicit in Quetelet's ideas as a Positivist thinker that he was ;) - is that once a major technological/economical/natural event sets a growth mechanism in action, population levels change according to that "law", until a new event upsets this "law". In the case of venezuela, there has been more than one inflection point in the population curve, indicating that perhaps there have been several transcedental events affecting population growth. In fact, during the first 2 decades of the 20th century, there were several famines, malaria epidemics and adverse economic conditions (low prices of coffee, which was the countries staple export product). You can observe that at this time, population growth slowed down to almost 0. But then, during the 30's, with oil explotaion, the population growth picked up speed. Currently, it seems to be slowing down and the population is aging. According to my models, growth will stop by 2030. Has there been a new transcedental event to change these dynamics (like for example the chavista "socialist" revolution that has been going on for 15 years now)? Impossible to say. From what you all have posted, I would think that if anything can be done at all by way of validation, it would have to be as Professor Yasser wrote, sliding a window of T observations, training on the first T-1 observations and validating on the last one. This I gather from Elroch's comment that cross-validation is always better when interpolating. How big would this window have to be? Big enough to barely have as many samples as parameters for a model? Big enough to cover the inflection points in the first part of the 20th century? If so, wouldn't that be like data-snooping? I can post the census figures and some R code for estimating the models if you guys are interested. |
Re: Machine Learning and census models
I think you can do it without one data because it is very difficult for Machine and Human Intelligence to get any good process behind this data. mybe tehy will help you.:bow:
|
Re: Machine Learning and census models
tnx for you.
دستگاه چلنیوم |
All times are GMT -7. The time now is 08:57 PM. |
Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.