University of Oklahoma: Development of mesoscale verification strategies for evaluating mesoscale numerical weather prediction models

The overall goal of this project was to develop strategies for verifying mesoscale information received from current and future generations of operational numerical weather prediction (NWP) models. New strategies are needed because there are fundamental problems related to recent trends in mesoscale NWP. Operational forecast models are consistently moving to higher and higher resolution, yet current verification metrics are appropriate for large-scale, smoothly varying forecasts. When models produce small-scale, high amplitude features (SHAF), similar in character to real features that threaten life and property, they are typically penalized by currently used verification metrics (Baldwin et al. 2001). Since verification scores directly influence trends in model development, newer generations of models tend to favor smoothly varying solutions rather than SHAF, in spite of the fact that predictions of SHAF might be very helpful to forecasters. In short, current verification metrics are inconsistent with the notion of measuring the value of mesoscale model forecasts to human forecasters.

In this project, we have taken three separate but related steps to ameliorate this inconsistency. First, we have investigated new ways of applying traditional verification measures (i.e., equitable-threat and bias scores), to evaluate their utility on smaller time and space scales. Second, we have actively solicited forecaster input from the SPC and elsewhere in order identify those elements of model forecasts that make them valuable to human forecasters. Third, we have designed new verification strategies that key on identifiable features and reflect the value assessments of forecasters.

Early in this study, automated procedures were established for precipitation data collection and statistical calculations as well as web-based graphical displays of equitable-threat (ET) and bias scores. Initially, we chose to concentrate on NCEP's operational Eta model and an experimental version of this model (Etakf) that is running twice daily at the National Severe Storms Laboratory (NSSL). In April 2000, we incorporated scores from a daily run of MM5 that was configured very similarly to the Etakf, though this run was terminated during the spring of 2001. These scores can be viewed on-line at http://vicksburg.nssl.noaa.gov/verf. In May 2001, we began including output from NCAR's experimental run of the Weather Research and Forecasting (WRF) model on a separate web site: http://www.nssl.noaa.gov/etakf/verf. In addition, twenty-four hour precipitation totals from these runs are plotted along with other operational and experimental versions of NCEP models at http://sgi62.wwb.noaa.gov:8080/verf/pcpgifs.html.

These web sites and data archives have proven to be valuable resources for sharing information with model developers from NCEP's Environmental Modeling Center (EMC), NCAR, and elsewhere. They provide ET and bias scores for commonly used time and space scales (24 h time periods and the 48 contiguous states) but we also provide these scores using higher resolution in both time and space. For example, results are presented in 3 h and 12 h segments and separately over numerous regions of the country. In addition, time series of scores are plotted for various precipitation threshold values.

These unique ways of presenting the data have allowed us to show, for example, that the Etakf run has a much higher amplitude diurnal cycle than the operational ETA, and that the coastal precipitation bias in older versions of the ETA model has become much less prominent since the model was modified to address this problem. They have also revealed that the ETA and Etakf runs often differ substantially over short time periods and/or various regions, but over longer time and space scales, they are very similar. In general, ET scores from the ETA tend to be higher than the Etakf for precipitation rates of about 0.5 in./day or less, while the Etakf tends score higher for larger rates, especially during the colder months of the year. Both configurations of the ETA show a high bias for the lower precipitation thresholds. The Etakf typically stays close to a bias of one at higher thresholds, while the ETA drops off sharply for higher rates. Preliminary results from NCAR's version of the WRF model are more like the Etakf than the operational ETA

The automated procedures that collect and manipulate data for this precipitation verification and display will continue to operate indefinitely as a semi-permanent legacy of this project. Even though ET and bias scores clearly reward smooth, larger-scale representations of forecast fields, we feel it is important to maintain this data set and to track these scores with newer generations of models. While future model development should be guided primarily by more sophisticated verification metrics, it is clearly desirable that ET and bias scores continue to improve with new models. The WRF model development team is currently monitoring real-time WRF performance using the web sites developed during this project.

Interactions between research scientists and forecasters occur on a daily basis at the SPC, through casual contact and as part of scheduled daily map discussions. These interactions often provide valuable feedback to model developers regarding the utility of various model solutions, output displays, etc. Likewise, forecasters benefit from the developers' insights into model behavior. Each spring during this project, our subjective impressions of model performance have been documented during model verification programs that have involved all of the PIs on this project. These programs have been carried out with active participation from the SPC, NSSL, EMC, the Forecast Systems Laboratory (FSL), the Norman Weather Forecasting Office, and Iowa State University.

In the spring of 2000 we formally evaluated the guidance provided by the three different models - the ETA, Etakf, and RUC II - in predicting convective initiation and evolution. This evaluation was done primarily in a "short answer" format. Each day participants were asked to select a specific area of interest, typically an area of slight or moderate risk for severe weather, based on the SPC's 13Z convective outlook. Then they were asked to respond to the following instruction: "Describe the scenario for convective development and evolution between 18Z and 09Z for each of the models below. Be sure to include timing and location of initiation, evolution, convective mode, and forecast intensity." Similarly, after examining the actual evolution of events the following morning, participants were asked to provide a subjective verification of the event by responding to: "Discuss the strengths and weaknesses of each model forecast of convective initiation and evolution compared to verifying data (include comments on timing, location, evolution, intensity, etc.). Was the guidance useful from an operational perspective?"

The process of verbalizing our opinions in this manner was very illuminating to the participants and prompted a number of fascinating discussions. For example, we found that:

More pointed questions queried participants about their opinions on which model(s) provided the best overall guidance for location and timing of convective initiation on a given day, and whether 12Z initializations showed an improvement over 00Z runs. Over the course of the experiment participants judged that the 12Z Etakf run provided the best prediction of initial location and timing of convection more often than any other model (Fig. 1). When 00Z runs of the ETA and Etakf were compared to 12Z runs of the same model, it was found that the 00Z run provided better guidance 38% of the time for both configurations of the ETA This was a surprising result as one might expect that model forecasts would consistently improve as that model initiation time gets closer to the verifying time.

Fig. 1. Subjective impressions from the NSSL/SPC Spring Program 2000: Best overall prediction of convective initiation and evolution, comparison of forecasts from 0000 UTC and 1200 UTC initializations.

In reviewing the 2000 spring program, it became apparent that many of the most interesting results from this project were very difficult to extract and to quantify concisely because much of the relevant information was imbedded within discussions. Building on the lessons learned in 2000, we designed a similar program for 2001 (see http://www.spc.noaa.gov/exper/Spring_2001), but with two major refinements. First, the systematic evaluation of model output was incorporated within a forecast-preparation process, similar to an operational forecasting environment. Second, participants conveyed their impressions about model data through survey-type rating scales rather than through narrative answers (although complimentary discussion was encouraged). Specifically, participants were asked to evaluate ten different forecast fields for each model as part of the forecast-preparation process.

These fields were selected ahead of time by SPC forecasters to represent commonly resourced measures of instability, dynamic forcing, and moisture, as well as model predictions of precipitation. Each of these fields was rated (on a scale from 1 to 10) in two different categories. The first was an assessment of how favorable that field was for severe convection in the given model's solution. The second was a measure of forecaster confidence in that model's forecast of that field. In addition, each model (including 10 forecast models: 00Z and 12Z ETA and Etakf, 12Z and 15Z RUCII, 12Z and 15Z experimental RUC with 20 km grid length, 12Z regional ETA with 10 km grid-length, and an experimental version of the WRF model run at NSSL) was rated for overall utility in the forecast preparation process and forecasters were asked to identify which fields were particularly critical for each forecast. The following day, each field that was evaluated was subjectively verified against observations.

Fig. 2. Results from comparisons of Eta and Etakf forecasts during the NSSL/SPC Spring Program 2001.
a.) Subjective impressions of QPF forecasts and b.) equitable threat scores for the same spatial and temporal domain.

With this approach, we compiled a comprehensive data set that has the potential to reveal a wealth of information about how forecasters use model data and what elements of model forecasts are particularly valuable to them. Preliminary results are consistent with the previous year's findings. For example, focusing on the comparison between the ETA and Etakf runs (for forecasts in which all of these runs were verified), the 12Z run of the Etakf was ranked the highest for quantitative precipitation forecast (QPF) 34% of the time, more than any other configuration of the ETA The 00Z Etakf was next (a change from last year), followed by the 00Z and 12Z operational ETA runs, respectively (Fig. 2a). These comparisons have been done in several other ways, including various statistics based the mean values of the individual rankings as well as statistical evaluation of the daily rankings, with qualitatively similar results.

As we had anticipated, these subjective verification statistics present a picture that is quite different from ET scores. Specifically, ET scores corresponding directly in time and space to these subjectively derived comparisons clearly favor the operational ETA model (Fig. 2b). This contradiction substantiates our argument that current verification metrics do not measure the value of model predictions to human forecasters.

An unanswered question at this stage is what characteristics of certain forecasts make them valuable to humans? Our perception of this is best illustrated by an idealized example. Consider simulated precipitation fields generated using an elliptical shape function, following Williamson (1981). In the "observed" field (Fig. 3), a relatively large ellipse is found, with several smaller-scale, higher-amplitude ellipses embedded within it. The domain consists of 128 x 128 grid points. For the sake of providing dimensions to the problem, if we assume the grid spacing is 5 km, the large-scale ellipse is approximately 1000 km long and 300km wide while the smaller-scale ellipses are approximately 100 km long and 50 km wide.

Fig. 3: Simulated precipitation fields; observed (left), forecast #1 (center), forecast #2 (right).

Simulated forecast #1 consists of a single large-scale ellipse, whose center is displaced compared to the observed larger-scale ellipse but with similar amplitude. The orientation of the ellipse is also in error, and the forecast ellipse is wider than the observed. Simulated forecast #2 contains features that are shaped similarly (both larger and smaller-scale) to the observed field, but the entire area is displaced to the "southeast" compared to the observed field and the amplitude of the larger-scale ellipse is slightly less than the observed. In addition, the randomly configured smaller-scale ellipses are positioned differently relative to the center of mass of the larger-scale feature.

Visual inspection of these forecasts suggests that forecast #2 is more realistic than forecast #1. In particular, forecast #2 predicts a reasonable distribution and pattern of SHAF, while SHAF are absent in forecast #1. Since weather-related damage to life and property is often associated with SHAF, forecasters would likely find more value in forecast #2. Yet, Table 1 shows that popular verification metrics favor the smoother, lower amplitude forecast. In particular, forecast #1 produces lower mean absolute and root mean square errors (lower values preferred) and a higher equitable threat score (preferred); bias scores (ratio of average forecast to average observation) are equal.

Verification measure	Forecast #1	Forecast #2
Mean absolute error	0.157	0.159
RMS error	0.254	0.309
Bias	0.98	0.98
Equitable threat score(0.45 threshold)	0.170	0.102

Table 1: Results of traditional verification measures for simulated precipitation fields.

Brooks and Doswell (1996) categorize these verification metrics as "measures-oriented" approaches to verification. An alternate and more complete approach involves the analysis of the joint distribution of forecast and observations (Murphy and Winkler, 1987), dubbed by Brooks and Doswell (1996) as the "distributions-oriented" approach. An important element of this approach is the association, defined as the overall strength of the linear relationship between forecasts and observations. Scatter plots (Fig. 4) show that this relationship is not particularly strong for either forecast; forecast #1 has a correlation coefficient of 0.486 while forecast #2 has a correlation of 0.429. So a brief examination of the distributions-oriented approach suggests, once again, that a forecast system containing realistic SHAF is of lesser quality than a smooth, low amplitude forecast.

Clearly, these traditional approaches to forecast verification fail to reflect important elements of human perception of forecast quality. At this stage, we do not have a verification metric that will solve this problem. However, we do have a strategy for developing such a metric (or metrics). In order to place our strategy in proper perspective, we provide a brief review of related research.

Anthes (1983) recognized many of the problems with traditional verification techniques in the early days of mesoscale modeling. He emphasized the need for metrics that reflect the "realism" of a forecast. One technique suggested by Anthes involves the examination of characteristics of significant phenomena, such as the central pressure of cyclones, or maximum wind speeds of thunderstorms. A "phenomena-based" approach proposed by Williamson (1981) uses a method of pattern recognition to objectively identify geopotential height systems in a constant pressure-surface field. An empirical function is fit to the field that represents a high or low center, elliptically shaped with amplitude, position, and shape parameters. The parameters defining the function are determined by minimization, and good first guesses were required. A major issue regarding this technique is that it relies on an empirical function to fit shapes in the forecast and observed fields. This has the advantage of being able to explicitly define the attributes of the phenomena of interest. However, it also has the disadvantage of trying to fit possibly complex natural patterns by an empirical shape function.

Ebert and McBride (2000) also present methods to verify characteristics of phenomena (contiguous rainfall areas). When there is some overlap between observed and forecast precipitation areas, Ebert and McBride (2000) decompose the forecast error into components due to displacement, amplitude, and "shape" errors. While this information on the differences between forecast and observed spatial structure is certainly useful, it is unrelated to the types of meteorological phenomena associated with the forecast and observed areas of rainfall, whether these were in agreement, etc. For example, consider the idealized observations and forecasts in Fig. 1. Both forecast #2 and observed rainfall fields contained the same scales of elliptically shaped rainfall, but they are distributed differently in space. In this case, the Ebert-McBride technique would diagnose large "shape" errors, even though the size and shape of the smaller-scale rainfall predicted features were nearly identical to those observe.

Anthes (1983) also suggested comparing the spectra of observed and forecast fields. Zepeda-Arce et al (2000) provide a recent example of this approach, using wavelet transforms to compute the spatial variation of the rainfall field as a function of horizontal scale. They examine how the variance of the spatial fluctuations change as a function of scale for the observed and forecast fields, showing how well the forecast is capturing the spatial structure of the field. This technique provides information on the "climatology" of a forecast system, but no information on the forecast accuracy. Furthermore, this technique will not provide information on displacement and phase errors.

Anthes (1983) also recommended the use of a correlation matrix scoring method (Tarbell et al, 1981). Although this may be able to provide some information on phase or displacement errors as well as the spatial structure of the fields, this method cannot objectively determine whether the maximum correlation is the result of the same (or similar) meteorological phenomena. In addition, the presence of SHAF may cause substantial uncertainty in the determination of the phase or displacement error. For example, returning to our idealized scenario, one would find several different local maxima through the auto-correlation technique.

Clearly, mesoscale verification efforts over the last twenty years or so have been only partially successful, a reflection of the numerous challenges associated with this problem. Our strategy for the development of new approaches is based heavily on this previous work and on the perceived needs of operational forecasters. Specifically, our working strategy is to classify or categorize the forecast and observed fields prior to verification. These fields are decomposed into subsets of small domains of a predetermined size and the predominant meteorological phenomenon within each sub-domain is then classified using pattern recognition techniques.
Guidance in determining the domain size and identifying significant events for the objective classification system is obtained from operational forecasters and decision makers, ensuring that the results of the verification are tailored and focused on the aspects of the forecast problems that are most critical to end users. Once identified within the observed and forecast fields, the joint probability of the forecasts and observations of particular events could also be examined. There appears to be an opportunity to extend the general framework of the distributions-oriented verification approach to verifying the realism of forecasts.

We call this approach, in which the joint distribution of the set of forecast events is compared to the set of observed events an "events-oriented" approach to verification. Development work on this approach is continuing, focusing on an objective method for identifying and classifying events. This research is following the process and using the well-established techniques in the field of knowledge discovery in databases and data mining, associated with the tools of discovery of patterns within large and complex sets of data. A large historical database, richly populated with a variety of interesting and important phenomena that span a large portion of the entire range of possible events, is being analyzed. Several methods of data reduction will be tested, including: analysis of statistical distributions, cluster analysis, principal component analysis, and spectral/wavelet analysis.

This project has directly or indirectly supported numerous activities that have cultivated a strong working relationship between the university community and the NWS. Some examples:

Baldwin, M. E., and J. S. Kain, 1999: Verification issues: Obtaining useful information on mesoscale skill. Presented at NCEP, the World Weather Building, Camp Springs, MD, November 1999.

Baldwin, M. E., and J. S. Kain, 1999: Extending traditional QPF verification techniques to obtain information on mesoscale forecast skill. Presented at the American Geophysical Union Fall 1999 Meeting, December 17, 1999, San Francisco, CA

Baldwin, M. E., and J. S. Kain, 2000: Evaluating and improving mesoscale components of QPF guidance. Presented at the USWRP Science Symposium, Boulder, CO, March 2000.

Baldwin, M. E., M. P. Kay, and J. S. Kain, 2000: Properties of the Convection Scheme in NCEP's ETA Model that Affect Forecast Soundings. Preprints, 20th Conference on Severe Local Storms, Orlando, FL, Amer. Meteor. Soc., 447-448.

Baldwin, M. E., S. Lakshmivarahan, J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints, Ninth Conference on Mesoscale Processes, Amer. Meteor. Soc., Ft. Lauderdale, FL, July 30 - August 2, 2001.

Janish, P. R., S. J. Weiss, J. S. Kain, and M. E. Baldwin, 2001: Advancing operational forecasting through collaborative applied research programs at the Storm Prediction Center and National Severe Storms Laboratory. Preprints, 18th Conference on Weather Analysis and Forecasting, Amer. Meteor. Soc., Ft. Lauderdale, FL, July 30 - August 2, 2001.

Kain, J. S., and M. E. Baldwin, 2000: Parameterized Updraft Mass Flux as a Predictor of Convective Intensity. Preprints, 20th Conference on Severe Local Storms, Orlando, FL, Amer. Meteor. Soc., 449-452.

Kain, J. S., and M. E. Baldwin, 2000: Lessons Learned from Experimental Forecasting with the ETA Model at NSSL/SPC, presented at the NCAR/MMM Precipitation Workshop, Nov. 15, 2000, Boulder, CO.

Kain, J. S, M. E. Baldwin, P. R. Janish, S. J. Weiss, 2001: Utilizing the ETA model with two different convective parameterizations to predict convective initiation and evolution at the SPC. Preprints, Ninth Conference on Mesoscale Processes, Amer. Meteor. Soc., Ft. Lauderdale, FL, July 30 - August 2, 2001.

Kain, J. S., and P.R. Janish, 2001: Advancing operational forecasting through collaborative applied research programs at the Storm Prediction Center and National Severe Storms Laboratory. Presented at the National Centers for Environmental Prediction, Camp Springs, MD, January 31, 2001.

Kain, J. S., and P.R. Janish, 2001: Advancing operational forecasting through collaborative applied research programs at the Storm Prediction Center and National Severe Storms Laboratory. Presented at the National Severe Storms Laboratory, Norman, OK, March 9, 2001.

Schwartz, B.E., S.J. Weiss, and S.G. Benjamin, 2000: An assessment of short-range forecast fields from the Rapid Update Cycle related to convective development. Preprints, 20th Conf. On Severe Local Storms, Amer. Meteor. Soc., Orlando, FL, 443-446.

Weiss, S.J., 2000: The SPC/NSSL spring program 2000: Evaluation of model forecasts of convective initiation and evolution. Presented at EMC Annual Review, Camp Springs, MD, December 12, 2000.

Weiss, S.J., J.S. Kain, M.E. Baldwin, J.A. Hart, and D.S. Stensrud, 2000: Some aspects of mesoscale model forecasts for the 3 May 1999 tornado outbreak. Presented at the National Symposium on the Great Plains Tornado Outbreak of 3 May 1999, Oklahoma City, May 4, 2000.

Weiss, S.J., 2000: The SPC/NSSL spring program 2000: Evaluation of model forecasts of convective initiation and evolution. Presented at EMC Annual Review, Camp Springs, MD, December 12, 2000.

This research benefits the University of Oklahoma's teaching and research missions in many ways:

This collaboration has resulted in numerous benefits to the SPC forecasting program, through enhanced operational/applied research interactions that has resulted in increased forecaster understanding of model physical processes and performance characteristics. This has led to the issuance of improved severe weather forecasts by SPC forecasters. Specific items of interest include:

Anthes, R. A., 1983: Regional models of the atmosphere in middle latitudes. Mon. Wea. Rev., 111, 1306-1335.

Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints, Ninth Conference on Mesoscale Processes, Amer. Meteor. Soc., Ft. Lauderdale, FL, July 30 - August 2, 2001.

Brooks, H. E., and C. A. Doswell III, 1996: A comparison of measures?oriented and distributions?oriented approaches to forecast verification. Wea. Forecasting, 11, 288-303.

Ebert, E.E. and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol., 239, 179-202.

Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330-1338.

Tarbell, T. C., T. T. Warner, and R. A. Anthes, 1981: An example of the initialization of the divergent wind component in a mesoscale numerical weather prediction model. Mon. Wea. Rev., 109, 77-95.

Williamson, D. L., 1981: Storm track representation and verification. Tellus, 33, 513-530.

Zepeda-Arce, J., E. Foufoula-Georgiou, and K. K. Droegemeier, 2000: Space-time rainfall organization and its role in validating quantitative precipitation forecasts. J. Geophys. Res., 105, 10129-10146.

Univ. of Oklahoma: "Development of mesoscale verification strategies for evaluating mesoscale numerical weather prediction models"

Final Report