Evaluation is a process that, rather than assessing a single or even a specific set of metrics measuring forecast skill, varies between different end-users interested in using forecast information to make different decisions. To illustrate this, let's break evaluation down by a few key streamflow forecast characteristics:
- Horizon - as forecasts move further into the future, their skill decreases. It is helpful to define forecast skill at a range of forecast horizons. How well we are forecasting river flows tomorrow, next week, or next month all require different evaluation scores.
- Magnitude - hydrographs are continuous processes that are measured in regular intervals with varying magnitudes over time. Forecasting low-flow baseflow conditions is different than forecasting peak flood flows, and yet the skills for each are commonly aggregated into a single metric (e.g. Nash-Sutcliffe or Kling-Gupta). It may be beneficial to break up the forecast skill by the type of flow condition being measured.
- Season - forecast skill will vary during the year as the runoff-generating processes change. For example, the skill in forecasting multi-week flows in snow-dominated regions during the spring freshet will be higher than for the same period when runoff would be storm-dominated. Breaking out horizons and magnitudes by season will increase your understanding of forecast skill.
Which measure of evaluation skill is important ultimately depends on the decisions being made, which are informed by forecasts. For example, emergency managers might be interested in peak flows a few hours ahead, whereas dam operators selling power into day-ahead markets may be interested only in 24-hour forecast volumes. Other dam operators may need to manage reservoir storage for flood control or water supply and be looking several days, weeks, or even months ahead. The first two cases require forecast skill on flood peaks, whereas energy marketers may be more focused on more common day-to-day flow conditions. Offered two sets of forecasts, one operator may forego skill in a certain horizon and magnitude range to increase skill elsewhere.
Unlocking the benefits for practical application
Evaluation of streamflow forecasts is not commonplace throughout industry - why? It's not due to a lack of academic literature, which is flush with resources to help forecasters, managers, and operators evaluate forecast skill with 100's of different metrics. Instead, it is more likely due to a lack of access to the infrastructure required to support a robust evaluation process. Evaluation requires specialized databases to store every forecast generated in addition to the observed conditions. For deterministic (single trace) forecasts, this can be challenging, but it is even more daunting for ensemble (multi-trace) forecasts. Second, systems able to access, process, extract, and visualize these forecasts are also required. With so many organizations using different databases, data structures, forecast horizons and frequencies, it can be challenging to manage the large data quantities required. Furthermore, real-time systems can produce and accumulate very large amounts of forecast performance metrics – who has the time to look at all that information? Careful thought must be given to efficiently visualize the evaluation results so that industry users can interpret them for practical applications.
With a robust evaluation system at hand, decision-makers can adjust their expectations of forecast performance and how they use the information in decisions, e.g. whether the flow is 100 ± 3 or 100 ± 30 can have a big difference on current decisions. Likewise, for users and organizations that generate their own forecasts or use publicly available data, evaluation can indicate where and how to improve the forecasts where it matters most. Thus, operating a robust evaluation system in addition to a forecast system can lead to a wide range of potential benefits.
So, if you ever hear someone claim their forecast is 93% accurate, dig a little deeper to better understand what their forecasts are actually showing and how such skill is impacting their decision-making process. Their crystal ball may be telling an incomplete tale.
Learn more about our forecasting and monitoring services to see how we can support your organization.