Before I discuss how the models did in the last storm, I'd like to put up a brief discussion of model averaging, and why I think it works. Although the example I give will be overly simplistic with a lot of assumptions, I hope it gets some of the main ideas across.
Let's imagine a situation in which we're interested in the track of a storm. For simplicity, say we're only concerned with whether it's going north or south. The Euro comes in and shows us where it thinks the storm is going.
So what does that tell us? The Euro is just an estimate of the true track of the storm, and there will be some error in its forecast. The storm might go more north than the Euro thinks, or it might go more south. The true track is more likely to be close to the Euro's track than it is to be really far away. If we plotted the historical distribution of actual storm tracks relative to the Euro forecast tracks, we'd get some probability distribution around the Euro's forecast track. It would look something like this.
This tells us the relative probabilities of the actual storm tracks given the Euro's forecast. (In the above example I've assumed the distribution is a Normal distribution, a.k.a. a "bell curve". There are some good reasons to use this distribution, but it's an assumption I'll revisit below.)
OK, now imagine that the GFS comes in with a forecast that is well north of the Euro. There will also be a probability distribution of possible storm tracks given the GFS forecast.
Notice how I've made the distribution wider for GFS tracks than for Euro tracks. That's because the GFS has less skill than the Euro, so the average error in forecast tracks will be larger for the GFS.
So now what is the probability distribution for storm tracks considering both the GFS forecast and the Euro forecast? Assuming the errors in the forecasts are independent (i.e. they're not correlated), we can just multiply the two distributions above, then re-scale the resulting distribution to make sure all probabilities add up to 1.
If we assume our initial probability distributions were normal, then this resulting probability distribution (the grey crurve above) is also normal. The grey curve has a couple of interesting properties:
1. The location of the peak of the grey cuve is a weighted average of the locations of the peaks of the blue and orange curves. The weights are given by the inverse of the squares of the widths of the blue and orange cuves. So for example if the orange curve is twice as wide as the blue cuve, then it should be given 1/4 the wieght of the blue curve when calculating the average.
2. The grey curve is narrower than either the blue cuve or the orange curve. We actually have more confidence in the average forecast track than we do for any single op run.
Now let's say that we get closer to game time, and forecast tracks of the Euro and GFS are just as far apart. But since we're at shorter lead times (say a 72-hour forecast instead of a 120-hour forecast), the expected errors in both the Euro and GFS forecasts are smaller. The spread in likely storm tracks given these forecasts will be smaller as well.
Now the probabiliy of either the Euro or GFS being "right" gets smaller, and we're even more likely to end up with something in the middle.
There are a few important assumptions to consider in this analsysis:
1. The assumption that the forecast errors are normally distributed. There are some statistical arguments to support this assumption, and based on some quick internet searches normal distributions appear to be widely used historically. However I didn't find much empirical evidence supporting this assumption. It is almost certainly more valid at short lead times than long ones, and for some forecast variables more than others. If the distribution is not normal, that could change some details of the above analysis, but we would still expect "average" tracks in between the two forecasts to be the most likely.
2. I have assumed that there is no bias in the model output. For example, if one model consistently forecasts storm tracks that are on average too far south, the model output should be adjusted north before including it in any average.
3. I have assumed the errors are not correlated. This is actually a pretty big assumption, as model errors almost certainly are correlated. It is best to use models that are not highly correlated. When model errors are correlated, you should further decrease the weight you give to the less-skilled model. Examples of models that (I believe) are highly correlated include any op and it's ensemble control run, the GGEM and RGEM, and any model and the para-version (i.e. the next incremental improvement) of the same model. Models that use different physics will probably have less correlation and be more useful. I suspect the ICON, which is non-hydrostatic, probably has relatively low correlation with the rest of the globals, which makes it more useful in model averaging than it otherwise would have been.
4. I have assumed that we can simplify the model output to a single dimension, e.g. north vs. south. Reality is more complicated. That brings us to the 03/12 system, which I'll discuss in the next post.