Practical Forecast Performance Evaluation – Part 4: Representativeness
In assessing flow forecast performance it is important to look at times when the forecasts are important factors in decision-making and evaluate how well the forecasts predicted what eventually happened. In Part 1, Part 2, and Part 3 of our series, we introduced the concepts of forecast events, flow events, and alignment. In this final installment of the series, we will outline a way to ensure that the forecast performance evaluation is as representative as possible. This is especially important during times of increasing climate variability, when the current hydrology might be very different from that a few decades ago and we cannot assume that forecasts that were good in the past will be a good indicator of the future performance.
We know from statistics 101 that stable statistics require large sample sizes and that statistics that are based on very few samples are typically not representative of the entire dataset. Ideally, we have assembled an archive of forecasts or hindcasts, which are forecasts that would have been created in the past if we had had a system to create them. With a long enough archive of forecasts or hindcasts at hand, we can calculate meaningful and stable ‘goodness of fit’ statistics or manually assign alignment scores. In practice, we rarely have a large archive of forecasts available and only have a few events of interest available to assess. So how can we gain at least some insights into the forecast performance given the little information to work with?
Well, as it is often the case in water resources, we make the best of the little data we have and aim to collect more to study later. Once we have these new data, we update our assessment of how well our forecasts are performing. In statistics this is often called sequential analysis. In its simplest form we can just update the average of all our alignment scores any time we compute a new one. More complex methods that account for the uncertainty in any one score (“is it really 0.7? Or maybe somewhere between 0.5 and 0.9?”) can also be applied.
So rather than waiting a long time before we can assemble a sufficiently large set of forecast events, flow events, and alignment scores (or statistics) to assess forecast performance, we can do it as we go. We will have rough indications of forecast performance via a few alignment scores or statistics early on and can refine them as time goes on, trusting the forecasts more and more if the aggregate alignment score inches up.
Perfect? No. But practical.