LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   The Final (http://book.caltech.edu/bookforum/forumdisplay.php?f=138)
-   -   Data snooping and science (http://book.caltech.edu/bookforum/showthread.php?t=4316)

Michael Reach 05-27-2013 02:28 PM

Data snooping and science
 
I was very struck by the idea mentioned in Lecture 17 (and in the book) that you can be guilty of data snooping because you are using information about the data that your colleagues gained earlier. It was especially interesting because I just watched a video from a famous physicist where (I think) he violated this rule. In case you're interested,
http://www.youtube.com/watch?v=6ledD81ofy0&t=14m20s
(the part I wanted starts at around 14:30)
Dr. Muller (head of BEST temperature project) points at world temperature anomalies for the last century or two. He says that what should really impress skeptics is: he can model the anomaly with just one parameter.
He can't, right? There is a set of temperature anomalies, and there are literally dozens of models representing the data. There are zillions of parameters and ways of representing the data, and I don't see how to ignore all that at this late date.

I'm really kind of confused (I know little about climate science - I'm asking about the Machine Learning side of this). Say I would want to become a big expert and analyze climate models, and figure out sensitivity to CO2 or such. How would one begin to give an estimate of the generalization of one's modelling? You have only one data set, so and so much data over the last century or whatever. It really isn't that much data, and there sure are a lot of variables to tweak and potential hypotheses. And then you test it: each year you get a few more data points to check. Coming in from the standpoint of this course, it would seem almost hopeless to me. Is that correct?

yaser 05-27-2013 04:41 PM

Re: Data snooping and science
 
Interesting situation. In this area in particular, the matter is further complicated by the desire of one group to reach one conclusion and of another group to reach the opposite conclusion. The possibility of data snooping creates a suspicion about whether the conclusions are genuine, and contaminates the data for further investigators even if the data would reasonably lead to one conclusion or the other. A remedy, which is to test different hypotheses out of sample on new data, is tricky in this case as the time constants involved may be impracticable.

Elroch 05-31-2013 03:34 AM

Re: Data snooping and science
 
This issue of the relationship between machine learning and the scientific method is a fascinating one. To me, machine learning abstracts the scientific method and can be considered to include its purest (and perhaps only epistemiologically valid) form. We forget how good we are naturally at the job of building extremely high level models of the world, which often allows scientists (and all of us) to shortcut a mechanical implementation of the learning cycle, but it is implicit in all scientific knowledge (and a lot of other conscious and subconscious understanding of the world).

I believe the point in the lecture is this: there is no uncertainty about the laws of physics. Climate change is a direct consequence of physics, without any additional assumptions. The only difficulties in applying the laws of physics to global warming predictions are:
(1) the computations are hard, and can only be approximated. This is the reason the most powerful computers ever built are used to do them, and why the differences between predictions and the error bars on them have declined over time.
(2) the past data is not as detailed as we would wish (and, to a lesser extent, not quite as precise as we would wish)
(3) there is uncertainty about some of the other data for the future (such as the minor effect of small fluctuations in the solar constant and other anthropogenic effects, such as SO_2 emissions).

I think it is the complete certainty about the physics that allows the strong statement made by Dr. Muller, where I presume he was referring to the effect of variation in CO_2 emissions on temperatures, a one parameter relationship. This effect can be isolated even if one only has an approximation to the boundary conditions and the fluctuations in the solar constant. You can simply do different runs, with everything being the same except for different CO_2 emission profiles over time. As Michael points out, there is acknowledged uncertainty in the extrapolation of the one parameter relationship, because of the details of positive feedback mechanisms (eg methane emissions from melting permafrost, lowered albedo of polar regions) and negative feedback mechanisms (eg from increased cloud cover).

Here is an early example of checking a prediction against what happened after it was made, safe from the slightest chance of snooping:
http://www.guardian.co.uk/environmen...global-warming

Michael Reach 06-02-2013 01:03 PM

Re: Data snooping and science
 
Elroch, I don't know if you read Nate Silver's book "The Signal and the Noise", which I enjoyed a lot. He spends a lot of time on overfitting. And he does point out there that the original global warming predictions have been much closer to the mark on the test data (the future) than the later model predictions. He suggests that overfitting is the reason.
Only thing is, a lot of people got really upset at him for it!

Elroch 06-02-2013 03:02 PM

Re: Data snooping and science
 
Quote:

Originally Posted by Michael Reach (Post 11014)
Elroch, I don't know if you read Nate Silver's book "The Signal and the Noise", which I enjoyed a lot. He spends a lot of time on overfitting. And he does point out there that the original global warming predictions have been much closer to the mark on the test data (the future) than the later model predictions. He suggests that overfitting is the reason.
Only thing is, a lot of people got really upset at him for it!

That is very interesting! I am surprised, as my understanding was that those models were entirely physical rather than heuristic.

One thing that annoys me is when denialists argue against global warming on the basis of short term data. There are people who simply cannot comprehend the idea of a trend with some sort of random or extraneous variations superimposed on it, despite the fact that they exist in almost every aspect of the world.

Michael Reach 06-02-2013 04:19 PM

Re: Data snooping and science
 
Elroch, I was pretty resolved to stay out of discussing the actual science, since, as I said, I don't know much about it. But I am having trouble following what you're saying.
First, I'm not sure what you mean by the "complete certainty" of the physics. Probably you mean that one aspect of the physics, the amount of heating that CO2 would cause other things being equal, is simple physics. But I don't know why you think that should help Dr. Muller's statement, since other things are not equal. The tough part of the job is going to be to figure out what role all the other factors play. In the end, as you mentioned, the size of the feedbacks is a critical issue, with estimated values ranging (near as I can tell) over a factor of six or so, or maybe a factor of two from more recent work. And without the feedbacks, the basic sensitivity to CO2 would be much less concerning.
In any case, I don't think that was what Muller was saying. Obviously, if all he needs is to estimate a single parameter, that he can estimate it with one parameter isn't going to impress most skeptics very much! My impression from the video (and from some comments by his co-worker Steve Mosher elsewhere) is that they found that the best fit to the data was given by fitting the CO2 and dropping all other variables. Even volcanoes, which have an obvious immediate impact, dropped out if you look over a few years span, and he was left with no explanatory variables that helped except CO2.
Anyhow, if that's what he was saying, that's what I was asking: to what extent is he allowed to do that, and to what extent do we say that he's using a lot of hypothesis choices made by others?

"One thing that annoys me is when denialists argue against global warming on the basis of short term data." That's interesting: From a Bayesian point of view, the last decade of data shouldn't "argue against global warming", but it certainly must bring down the estimate of the climate sensitivity to CO2: that's pretty much automatic from Question 20 on the final! How much is going to be an important question. I believe there's a lot of discussion right now about a couple of papers currently submitted by Nic Lewis and some others, where he sharply lowers the sensitivity ranges based on the last decade of data - and others dispute his claims. He also has been complaining about the use of uniform priors in earlier IPCC estimates, so that part of the lecture is really very relevant!

Michael Reach 06-02-2013 04:29 PM

Re: Data snooping and science
 
Sorry, I hadn't noticed your earlier response. If I understood Silver right (it's been a few months), he suggested that there may have been enough random variation in the last decades of the twentieth century, say from 1970-2000 when they were building the models - and tweaking the parameters to get them right - to cause the models to run "hot". That was the training data, so to speak. Hansen's original predictions preceded the hot spell (that is, the rapid growth of temperature anomaly) at the end of the century, so his estimates were naturally somewhat lower. Now that there hasn't been the same rapid growth for a decade or so, his estimates are looking better and better.
If Silver is right, that's a classic case of overfitting.

Elroch 06-03-2013 05:01 AM

Re: Data snooping and science
 
Quote:

Originally Posted by Michael Reach (Post 11016)
Elroch, I was pretty resolved to stay out of discussing the actual science, since, as I said, I don't know much about it. But I am having trouble following what you're saying.
First, I'm not sure what you mean by the "complete certainty" of the physics. Probably you mean that one aspect of the physics, the amount of heating that CO2 would cause other things being equal, is simple physics. But I don't know why you think that should help Dr. Muller's statement, since other things are not equal. The tough part of the job is going to be to figure out what role all the other factors play. In the end, as you mentioned, the size of the feedbacks is a critical issue, with estimated values ranging (near as I can tell) over a factor of six or so, or maybe a factor of two from more recent work. And without the feedbacks, the basic sensitivity to CO2 would be much less concerning.
In any case, I don't think that was what Muller was saying. Obviously, if all he needs is to estimate a single parameter, that he can estimate it with one parameter isn't going to impress most skeptics very much! My impression from the video (and from some comments by his co-worker Steve Mosher elsewhere) is that they found that the best fit to the data was given by fitting the CO2 and dropping all other variables. Even volcanoes, which have an obvious immediate impact, dropped out if you look over a few years span, and he was left with no explanatory variables that helped except CO2.
Anyhow, if that's what he was saying, that's what I was asking: to what extent is he allowed to do that, and to what extent do we say that he's using a lot of hypothesis choices made by others?

"One thing that annoys me is when denialists argue against global warming on the basis of short term data." That's interesting: From a Bayesian point of view, the last decade of data shouldn't "argue against global warming", but it certainly must bring down the estimate of the climate sensitivity to CO2: that's pretty much automatic from Question 20 on the final! How much is going to be an important question. I believe there's a lot of discussion right now about a couple of papers currently submitted by Nic Lewis and some others, where he sharply lowers the sensitivity ranges based on the last decade of data - and others dispute his claims. He also has been complaining about the use of uniform priors in earlier IPCC estimates, so that part of the lecture is really very relevant!

Firstly, it's all physics, and all the physics is known to an unnecessarily high precision. But there are three issues with applying this physics

Firstly, the data we have about the state of the world's climate at any particular time is not perfectly detailed. But it has more detail than is needed for relatively simple outputs.

Secondly, the modelling of physical processes has to be approximate, out of necessity. But this modelling is, as far as I know, entirely physical, in the same way as weather forecasting models (which are vastly more detailed) are entirely physical. For an overview, see https://en.wikipedia.org/wiki/Global_climate_model

Thirdly, there is the uncertainty in the detail: you can't predict the weather in 6 months, however much computing power you have. But I don't think anyone claims that random fluctuations in the actual weather are going to affect the sorts of quantities (mostly averages) that climate change is about.

So considerations of Bayesians are irrelevant: there is no scope for curve-fitting in physical models, there is only the scope for trying to model physical systems as accurately as your computer and data and physical models will permit. The only room for modification over time is to model the physics more accurately. The nearest to an exception might be the issue of flux correction, but this is a technique that stopped being necessary as supercomputers became more powerful.

[EDIT: I should clarify that this is not a field I have worked in (although I have done a lot of physical modelling and simulation). I can name drop by pointing out that one of the other 9 mathematicians in my year in my college at Cambridge is one of the most senior climate scientists in the UK, Vicky Pope. (A few of her articles for the Guardian are listed at the end) ]

Michael Reach 06-04-2013 11:09 AM

Re: Data snooping and science
 
Elroch, I think you are mistaken. Certainly the wikipedia article doesn't discuss the point. Here's a place that does
"Tuning the climate of a global model"
http://onlinelibrary.wiley.com/doi/1...12MS000154/pdf
Note that the paper says that this model didn't need much tuning (still had some, though) because it was based on an earlier model that was carefully tuned to fit the 20th century data. This quote:
"Climate models ability to simulate the 20th century temperature increase with fidelity has become something of a show-stopper as a model unable to reproduce the 20th century would probably not see publication, and as such it has effectively lost its purpose as a model quality measure."

I think it is clear that none of the models are "purely physical" in the sense you mean. There are many models, and they make many choices, and they are constrained ("tuned") by the requirement that they must fit 20th century data. All those choices lead to different predictions and different sensitivities to CO2. I don't understand how one can claim that the choices are minor, when they lead to a range of several degrees C in their predictions (Figure 1 is very striking), as the paper points out. As such, overfitting is a potential issue, and Bayesian statistics should be usable to decide between them afterwards.

Elroch 06-04-2013 01:11 PM

Re: Data snooping and science
 
Quote:

Originally Posted by Michael Reach (Post 11025)
Elroch, I think you are mistaken. Certainly the wikipedia article doesn't discuss the point. Here's a place that does
"Tuning the climate of a global model"
http://onlinelibrary.wiley.com/doi/1...12MS000154/pdf
Note that the paper says that this model didn't need much tuning (still had some, though) because it was based on an earlier model that was carefully tuned to fit the 20th century data. This quote:
"Climate models ability to simulate the 20th century temperature increase with fidelity has become something of a show-stopper as a model unable to reproduce the 20th century would probably not see publication, and as such it has effectively lost its purpose as a model quality measure."

I think it is clear that none of the models are "purely physical" in the sense you mean. There are many models, and they make many choices, and they are constrained ("tuned") by the requirement that they must fit 20th century data. All those choices lead to different predictions and different sensitivities to CO2. I don't understand how one can claim that the choices are minor, when they lead to a range of several degrees C in their predictions (Figure 1 is very striking), as the paper points out. As such, overfitting is a potential issue, and Bayesian statistics should be usable to decide between them afterwards.

Actually, your interpretation is wrong in a crucial way, according to the climate scientists I have consulted. Tuning is not done using any information relating to trends in temperatures: it is only done using average conditions, in order to make models agree with empirical data that is completely independent of climate change (eg weather variations over a year). It is really a way of incorporating empirical knowledge of average short term behaviour into models via parameters that cannot themselves be measured accurately.

So the tuning you refer to involves no curve fitting to temperature increase at all. Any increase in closeness of fit to historical data is an unsurprising consequence of models becoming increasingly complete and of higher spatial and temporal resolution.

Note that the bottom line of the paper you referred to is that even when they tuned all the parameters that could be tuned, with the aim of creating variation in the predictions, they found that there was not as much variation in predictions as they had thought possible. In consequence, the conclusions that damaging global warming is likely in this century under a range of scenarios is made more robust rather than less.

[If you consider this as a policy issue for the world, bear in mind that even a significant risk of disastrous consequences would justify quite extreme measures: this is a case of highly asymmetric risk. This is why there is a majority agreement that such measures are necessary, and it is potentially catastrophic that there is not unanimity.]


All times are GMT -7. The time now is 11:52 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.