Winter model performance discussion

cae · March 9, 2018

March 7, 2018.

Below is the stage IV precipitation analysis (verification data) for the event. The color scale is the same as used for the model runs.

Below are the 00z and 12z model runs up to the event. The Euro is top left, GFS is top right, GGEM is bottom left, and ICON is bottom right. This gif starts 84 hours before the last run. Before that there was a rain event, and weather.us doesn't have a way to distinguish between the precip totals.

To get a better sense on how the models did in predicting snow, I plot the snow depth maps below. Snow depth is not a great metric because

1) Every model seems to calculate "snow depth" differently. (The Euro appears to use a more generous algorithm than the other models.)

2) There is something wrong with the snow depth maps for the GGEM on weather.us. It appears to ignore all depths below 2", which makes it look like the GGEM predicted no snow in areas where it predicted low snow depths.

Unfortunately, it's the only metric of snowfall available for all four models on weather.us.

The Euro is top left, GFS is top right, GGEM is bottom left, and ICON is bottom right. This gif starts 192 hours before the last run.

For comparison with the above plots, here are the reported snowfall totals from LWX.

cae · March 9, 2018

A few thoughts on model performance for 03/07/2018:

1. The Euro did very well in the long run. It consistently showed some sort of snow event for the nothern tier from 10 days out. I hadn't noticed that until I put these maps together.

2. At about 6-7 days out, the GFS showed some good widespread hits, like this one:

They had support from the GEFS, but not from the other models or ensembles.

We didn't get a widespread hit, but it wasn't as bad as what the Euro showed at that time.

3. The GGEM lost the storm about 72 hours out, sending it out to sea.

It gradually came back over the next two runs. It was interesting that the extended RGEM still had the storm when the GGEM had lost it.

4. All of the globals struggled with western edge of the precip. The Euro went from giving Lancaster, PA nearly no precip to about 1.5" in four runs within 72 hours of the storm. They got about 0.3". After the 18z runs the day of the storm, the HRRR started picking up on an eastward shift in the precip shield that ended up verifying.

The 03/06 12z RGEM ensemble mean also overdid the western extent of the precip

But a small fraction of members (about 10%) showed the risk of the shift.

I'm still trying to figure out how useful the RGEM ensemble is, and one of the things I've been wondering is whether we should pay attention to the spread in the members at game time. This above plot suggests the answer is yes.

cae · March 18, 2018

March 12, 2018.

Below is the stage IV precipitation analysis (verification data) for the event. The color scale is the same as used for the model runs.

Below are the 00z and 12z model runs up to the event. The Euro is top left, GFS is top right, GGEM is bottom left, and ICON is bottom right. This gif starts 96 hours before the last run. Before that there was another event, and weather.us doesn't have a way to distinguish between the precip totals.

We're missing some of the earlier ICON runs below because I didn't realize that weather.us only archives them for about 7 days.

I'm still looking for a consistent way to compare model output for snowfall. Instead of snow depth maps, below I show different snow total maps. They include 10:1 maps for the GFS (top right) and GGEM (bottom left), qpf as snow for the Euro (top left), and the snowfall map for the ICON (bottom right) using its own ratios. This gif starts 108 hours before the last run. Before that there was another event. (Some snow from that event can be seen in NE MD in the earliest panels.)

For comparison with the above plots, here are the reported snowfall totals from LWX. This map doesn't cover the entire state of Virginia, so I've also added a map of CoCoRaHS reports from 03/13.

cae · March 18, 2018

Some quick housekeeping: I renamed this thread "Winter model performance discussion" since that's a more accurate reflection of what it's intended to be. I also moved some of the posts to the second page of the thread because the first page was getting graphics-heavy. My browser was having a hard time keeping up. Now on to the last storm...

cae · March 18, 2018

Before I discuss how the models did in the last storm, I'd like to put up a brief discussion of model averaging, and why I think it works. Although the example I give will be overly simplistic with a lot of assumptions, I hope it gets some of the main ideas across.

Let's imagine a situation in which we're interested in the track of a storm. For simplicity, say we're only concerned with whether it's going north or south. The Euro comes in and shows us where it thinks the storm is going.

So what does that tell us? The Euro is just an estimate of the true track of the storm, and there will be some error in its forecast. The storm might go more north than the Euro thinks, or it might go more south. The true track is more likely to be close to the Euro's track than it is to be really far away. If we plotted the historical distribution of actual storm tracks relative to the Euro forecast tracks, we'd get some probability distribution around the Euro's forecast track. It would look something like this.

This tells us the relative probabilities of the actual storm tracks given the Euro's forecast. (In the above example I've assumed the distribution is a Normal distribution, a.k.a. a "bell curve". There are some good reasons to use this distribution, but it's an assumption I'll revisit below.)

OK, now imagine that the GFS comes in with a forecast that is well north of the Euro. There will also be a probability distribution of possible storm tracks given the GFS forecast.

Notice how I've made the distribution wider for GFS tracks than for Euro tracks. That's because the GFS has less skill than the Euro, so the average error in forecast tracks will be larger for the GFS.

So now what is the probability distribution for storm tracks considering both the GFS forecast and the Euro forecast? Assuming the errors in the forecasts are independent (i.e. they're not correlated), we can just multiply the two distributions above, then re-scale the resulting distribution to make sure all probabilities add up to 1.

If we assume our initial probability distributions were normal, then this resulting probability distribution (the grey crurve above) is also normal. The grey curve has a couple of interesting properties:

1. The location of the peak of the grey cuve is a weighted average of the locations of the peaks of the blue and orange curves. The weights are given by the inverse of the squares of the widths of the blue and orange cuves. So for example if the orange curve is twice as wide as the blue cuve, then it should be given 1/4 the wieght of the blue curve when calculating the average.

2. The grey curve is narrower than either the blue cuve or the orange curve. We actually have more confidence in the average forecast track than we do for any single op run.

Now let's say that we get closer to game time, and forecast tracks of the Euro and GFS are just as far apart. But since we're at shorter lead times (say a 72-hour forecast instead of a 120-hour forecast), the expected errors in both the Euro and GFS forecasts are smaller. The spread in likely storm tracks given these forecasts will be smaller as well.

Now the probabiliy of either the Euro or GFS being "right" gets smaller, and we're even more likely to end up with something in the middle.

There are a few important assumptions to consider in this analsysis:

1. The assumption that the forecast errors are normally distributed. There are some statistical arguments to support this assumption, and based on some quick internet searches normal distributions appear to be widely used historically. However I didn't find much empirical evidence supporting this assumption. It is almost certainly more valid at short lead times than long ones, and for some forecast variables more than others. If the distribution is not normal, that could change some details of the above analysis, but we would still expect "average" tracks in between the two forecasts to be the most likely.

2. I have assumed that there is no bias in the model output. For example, if one model consistently forecasts storm tracks that are on average too far south, the model output should be adjusted north before including it in any average.

3. I have assumed the errors are not correlated. This is actually a pretty big assumption, as model errors almost certainly are correlated. It is best to use models that are not highly correlated. When model errors are correlated, you should further decrease the weight you give to the less-skilled model. Examples of models that (I believe) are highly correlated include any op and it's ensemble control run, the GGEM and RGEM, and any model and the para-version (i.e. the next incremental improvement) of the same model. Models that use different physics will probably have less correlation and be more useful. I suspect the ICON, which is non-hydrostatic, probably has relatively low correlation with the rest of the globals, which makes it more useful in model averaging than it otherwise would have been.

4. I have assumed that we can simplify the model output to a single dimension, e.g. north vs. south. Reality is more complicated. That brings us to the 03/12 system, which I'll discuss in the next post.

cae · March 18, 2018

The 03/12/18 storm was interesting because the Euro and GFS were far apart, even fairly late in the game. The GFS showed more amped solutions that gave the areas around DC and Baltimore a good amount of snow, and the Euro was consistently more suppressed. The below frames show the model outputs at 03/09/18, about 72 hours before snow started falling in the region.

So what to do when the Euro and GFS are so far apart so close to game time? The above analysis suggests that it's highly likely that they'll meet in the middle. The problem is that "the middle" is not intuitive when you're talking about a real world storm. It's not as simple as "more north" or "more south". That's where the GGEM and ICON were useful. They showed what the "middle" could look like. And they were both closer to the truth than the Euro or the GFS. The GGEM did particularly well on this one, but even the ICON seemed to outperform the Euro and GFS at this range.

The GGEM and ICON weren't without their flaws though. They both lost the storm (suppression) on their 03/10 00z runs (correlated errors)? And the ICON took a while to see the turn up the coast. It seems to do better with the first stage of storms than the subsequent coastal development.

I'll also mention the ensembles in this storm. Given the unusually large spread in the op runs, ideally the ensembles should have reflected some of the uncertainty. But the EPS and GEFS tended to follow their ops. The GEPS were a little more consistent. Whether it's because they have more spread or because the GGEM was more consistent, I'm not sure. But they also gave a fairly large "false positive" signal, with at one point 85% of members giving me more than 1" of snow and a snow mean of about 6". They appeared to be too amped on average.

cae · March 22, 2018