CHAPTER 8

VALIDATION OF THE MODEL

VALIDATION PROCEDURES

Testing model performance during development of a model usually results in calibration, which is, according to Penning de Vries (de Vries and von Laar, 1982) a "very restricted form of evaluation," and "adjustment of some parameters such that model behavior matches one set of real world data." It can "degrade simulation into curve fitting."

Before any model can be used with confidence, adequate validation or assessment of the magnitude of the errors that may result from their use should be performed. Model validation, in its simplest form, is a comparison between simulated and observed values.

Beyond comparisons, there are several statistical measures available to evaluate the association between predicted and observed values, among them the correlation coefficient (r) and its square, the coefficient of determination (r²). Willmott (1982) has pointed out that the main problem with this analysis is that the magnitudes of r and r² are not consistently related to the accuracy of prediction where accuracy is defined as the degree to which model predictions approach the magnitudes of their observed counterparts.

Test criteria have been separated into two groups, called summary measures and difference measures. Summary measures include the mean of observed values (0) and predicted values (P), the standard deviations of observations (S_o) and the predictions (S_p), the slope (a) and intercept (b) of the least-squares regression:

P_i = a + b * 0_i

In addition, an index of agreement (d) (Willmott, 1982) was calculated as follows:

n n

2 2

d = 1 - [ S (P - O ) / S (|P'| + |O'|) ], 0 < d < 1

i i i i

i=1 i=1

_ _

where P' = P - O and O' = O - O

i i i i

Though (d) is meant to be used mainly to determine the relative superiority of alternative models, it can be valued as a descriptive parameter of model performance. The more (d) approaches 1, the more accurate the model.

While summary measures describe the quality of simulation, difference measures try to locate and quantify errors. The latter includes the mean absolute error (MAE), the mean bias error (MBE), and the root mean square error (RMSE). They all are calculated according to Willmott (1982) and based on the term (P_i - 0_i):

A) Mean Absolute Error (MAE):

MAE = S | P_i - 0_i | / n

i=1

B) Mean Bias Error (MBE):

MBE = S (P_i - 0_i) / n

i=1

C) Root Mean Square Error (RMSE):

RMSE = S (P_i - 0_i)² / n

i=1

MAE and RMSE indicate the magnitude of the average error, but provide no information on the relative size of the average difference between (P) and (0). MBE describes the direction of the error bias. Its value, however, is related to magnitude of values under investigation. A negative MBE occurs when predictions are smaller in value than observations.

In the test of data sets without N routines, attention was focused on the unsystematic error (RMSE_u) as in the system:

MSE = MSE_s + MSE_u

For a good model, the unsystematic error RME should approach RMSE, with RMSE_s approaching 0/, which is what could be observed.

In addition to these basic tests, a number of procedures were tested using the data sets with N routines only. These are:

o A simple regression technique, suggested by Dent and Black (1979) combined with an F-test to evaluate the null hypothesis of the slope and intercept simultaneously, being different from zero and unity, respectively.

o A statistic to determine model accuracy as defined by Freese (1960).

o A 5 percent critical error as defined by Reynolds (1984).

For results of the above mentioned tests, see Tables 1 and 2.

The parameters examined in the statistical evaluation were:

1. Anthesis date

2. Maturity date

3. Leaf area index (LAI) at maximum

4. Total above ground dry matter at maturity

5. Grain yield

6. Individual grain weight at maturity

7. Number of grains per m₂ at maturity

8. Dry matter at anthesis

9. N uptake by the crop at anthesis

10. N uptake in the above ground plant parts at maturity

11. N content of the grain

12. Grain protein percentages

Parameters 1 through 7 were evaluated without using N routines (see Tables 3 and 4), whereas parameters 4 through 12 specially referred to aspects of the N routines to be tested (see Tables 1 and 2).

Table 1. Summary Measures for Data Sets with N Routines

_ _

Variable Units N O P So Sp a b R d

Dry matter kg/ha 222 10,313 11,719 3,375 3,897 189.3 1.118 0.82 0.86

Grain yield kg/ha 240 3,953 4,227 1,716 1,719 145.6 1.033 0.84 0.91

Total N uptake kg N/ha 223 110 121 47 45 7.0 1.042 0.72 0.83

Grain N uptake kg N/ha 215 84 87 34 31 4.4 0.996 0.74 0.85

Grain protein percent 215 12.3 11.7 2.95 3.55 0.198 0.939 0.55 0.74

GPSM no. 152 12,381 11,861 4,668 5,324 362 0.929 0.63 0.78

Kernel weight mg 134 33.6 37.9 7.02 8.40 0.998 1.096 0.34 0.59

App recovery percent 137 45.7 43.4 29.4 25.68 8.37 0.767 0.37 0.65

Table 2. Difference Measures for Data Sets with N Routines

Variable Units N MAE MBE RMSE RT E^*

Dry Matter kg/ha 222 2,248.1 1,405.6 2,648.2 13.06 480.9

Grain yield kg/ha 240 806.1 274.4 1,024.2 15.39 186.5

Total N uptake kg N/ha 223 28.6 11.5 36.3 22.43 65.9

Grain N uptake kg N/ha 215 19.4 3.5 24.0 18.64 43.6

Grain protein percent 215 2.4 -0.6 3.2 18.65 5.8

GPSM no. 152 3,462.3 -520.8 4,355.4 22.43 778.6

Kernel weight mg 140 8.1 4.2 9.9 12.00 17.6

App recovery percent 137 23.5 -2.3 31.2 69.48 55.5

Note: N = Number of observations.

O = Mean of observations.

P = Mean of predictions.

SO = Standard deviation of observations.

SP = Standard deviation of predictions.

a = Intercept term from regression of predicted on observed.

b = Slope term from regression of predicted on observed.

R = Regression coefficient.

D = Index of agreement (Willmott, 1982).

MAE = Mean absolute error (Willmott, 1982).

MBE = Mean bias error (Willmott, 1982).

RMSE = Root mean square error.

RT = Model accuracy (Freese, 1960).

E^* = Five percent critical error as defined by Reynolds (1984).

GPSM = Grains per m²

LAI = Leaf Area Index

Parameters 1 through 7 were evaluated without using N routines (see Tables 3 and 4), whereas parameters 4 through 12 specially refer to aspects of the N routines to be tested (see Tables 1 and 2).

Table 3. Summary Measures for Data Sets Without N Routines

_ _ P = a + b*0

Variable Unit n O P S_o Sp a b R d

Anthesis days 82 144.6 144.2 30.3 28.4 14.6 0.8960 0.912

Phys mat days 76 176.0 176.6 28.4 28.5 4.7 0.9763 0.947

Yield kg/ha 157 5,547.0 5,646.8 2,219.0 2,578.2 582.2 0.9166 0.633 0.88

Grain wt mg/kernel 144 38.3 36.4 9.3 9.4 13.4 0.6234 0.448

GPSM #/m² 138 14,594.6 15,161.2 5,115.2 5,811.9 5,507.3 0.6715 0.376

Dry matter g/m² 76 1,179.2 1,559.1 525.5 412.2 791.8 0.5052 0.411

LAI -- 54 4.4 5.2 2.6 1.6 3.3 0.4342 0.260

Table 4. Difference Measures for Data Sets Without N Routines

Max Min

Variable Unit n MBE MAE Error Error RMSE_u

Anthesis days 82 -0.4 5.67 32.0 0. 8.5

Phys mat days 76 0.6 5.1 18.0 0. 6.6

Yield kg/ha 157 99.8 1,272.8 3,468.0 6. 1,552.0

Grain wt mg/kernel 144 -1.9 5.8 28.4 0. 6.5

GPSM #/m² 138 566.6 3,444.1 17,446.0 17. 4,437.5

Dry matter g/m² 76 379.9 352.0 1,324.0 2. 320.0

LAI -- 54 0.8 1.8 5.7 0. 1.9

DATA BASE

Since model development and testing is an iterative process in the early stages, most of the data sets have been utilized during the development phase and are not truly independent. Since development of

CERES-Wheat-N has necessarily lagged behind the development of CERES-Wheat, the opportunity to rigorously test the model with a large base of truly independent data sets has not yet arisen. Development and testing of the CERES-Maize-N model (Jones and Kiniry, 1986) has paralleled the work on the wheat model. Because the soil N transformation components of both models are identical and since the basic structure of the CERES-Wheat model dictates the nature of biomass production, yield component determination, and water balance, the testing data base can be inferred as having a sufficient degree of independence from model development. Other data sets have been used extensively in model development and calibration which have not been included in the analysis.

To test the CERES-Wheat model under different growing conditions, a data base was assembled to represent a diversity of environments, including short growing season spring wheat crops, environments with limited water availability, sub-tropical wheat growing areas with little vernalization, and regions with temperature extremes. The data base represents a time span of 25 years (1959 - 1984) and a range of latitudes from 54^o N (West Germany) to 36^o S (Australia) (see Fig. 8.1). This includes published experimental data as well as unpublished dissertations and other unpublished sources.

Fig. 8.1. Location of experimental sites.

Because a complete minimum data set was rarely available, the additional climatic and soils information was obtained mainly through personal communication. When a few key data were unavailable, certain model inputs were estimated using the best available information and the approaches described in Chapter 5.

A good data set should contain as much detail as possible to describe the process of plant growth. However, taking a sufficient number of samplings throughout the season is time consuming and expensive and was therefore performed only in experiments especially laid out for model testing. These were particularly valuable, because results of such experiments allowed for the decription of the time course of several aspects of plant growth and comparison with model output.

For a detailed description of individual data sets see Otter-Nacke, Godwin, and Ritchie (1986).

DIFFERENCE MEASURES AND SUMMARY STATISTICS

Phasic Development

The accuracy of simulating the phasic development of a crop is crucial for accurate simulation of crop growth and yield. Thus, evaluation of phasic development should be the first step. Observed and predicted means are close together with small RMSE, and similar standard deviations. While the intercept (a) is relatively close to zero and the slope (b) relatively close to unity for physiological maturity, this is not the case for anthesis date (see Tables 3 and 4).

Grain Yield and Components

Predicted yields versus observed yields are depicted in Fig. 8.2 and 8.3. Most predictions are within the limits of + 1.0 standard deviation. In only 30 out of 240 and 24 out of 157 cases, the difference between observed and predicted yield was larger than 1 standard deviation. Comparing the difference measure (d) for runs with and without account for N application summarizes the differences of all of these statistical parameters. D-values amount to 0.91 and 0.88, which indicates the model, including N- routines, is more accurate compared to runs without N routines. This is not a thorough test, however, because runs were performed on different data sets. CERES-Wheat tends to over predict yield. Means and standard deviations of predicted yields using or bypassing N routines approach those of the observations (Tables 1 and 3).

Fig. 8.2. Grain yield. Comparison of predictions of the CERES-WHEAT model with observed data from experiments.

^_____ is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of + 1 standard deviation from the regression line.

Fig. 8.3 Predicted versus measured yield.

Kernel weight and grain number per m² are the two composites of grain yield. Performance of the model in predicting these parameters was in general poorer than the simulation of grain yield. MBE indicates the model to over estimate grain weight under fertilized conditions (Table 2), whereas with N- routines switched off, the negative MBE suggests the contrary (Table 4). This is surprising because N is assumed to be non-limiting in the latter case. The intercepts and slopes of the corresponding regressions are likely to be significantly different from zero and unity (Tables 1 and 3).

This is also obvious for the regression of prediction on observed number of kernels per m² without account for N effects (Fig. 8.4). Though the regression line is closer to the 1:1 line when testing N application data sets (Fig. 8.5) there is still considerable scatter outside the standard deviation limits. Error biases of kernel number and kernel weight are alternating, suggesting the two genetic parameters G1 and G2 (responsible for the determination of kernel number and kernel weight) being estimated poorly to accomodate data sets of different origin. MAE and RMSE are of similar value.

Fig. 8.4. Predicted versus measured number of grains.

Fig. 8.5. Grains /m^**2. Comparison of predictions of the CERES-WHEAT model with observed data from experiments.

^_____ is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of + 1 standard deviation from the regression line.

Dry Matter and LAI

More scatter around the 1:1 line occurred for biomass predictions than for grain yield (Fig. 8.6 and 8.7). Means and standard deviations of predicted and observed dry matter differ more in data sets without account for N.

Fig. 8.6. Dry matter. Comparison of predictions of the CERES-WHEAT model with observed data from experiments.

^_____ is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of + 1 standard deviation from the regression line.

Fig. 8.7. Predicted versus measured above ground dry matter.

MBEs are positive indicating a tendency to over predict dry matter production. Other coefficients show the model simulation is acceptable over a range of total dry matter yields from 448 to 3389 g/m². The slope of the regression line for N trials (Fig. 8.6) supports the observation of over prediction. In the N non-limiting sets, the slope is only 0.5052 mainly due to a few outlyers at the high end of the range where dry matter was strongly under predicted.

Summary and difference measures only are given for leaf area index (LAI) (Table 3 and 4) because the error associated with measured values of LAI are usually very high. In this particular case, maximum LAI was requested for comparison between predicted and observed values. LAI measurements however, being time consuming, were not taken often enough to meet the peak of leaf area development. MBE being 0.8, indicates the model to over predict LAI by 18 percent on the average. Standard deviations are quite different, and intercept and slope are definitely different from zero and unity. These statistical values are not particularly meaningful, however, under the above premises.

Total N uptake, grain N uptake and grain protein values are of special interest when the nitrogen switch in the model is turned on. Simulation of these parameters is related to yield simulation, but did not excel as yield simulation. Forty-four points from a total of 223 fall outside the bounds of + 1.0 standard deviation of the 1:1 line for total N uptake (Fig. 8.8). For grain N uptake and grain protein percentage (Fig. 8.9 and 10) this was 31 from 215 and 60 from 215. Intercept and slope are just beyond the 5 percent confidence intervals, whereas other statistics suggest the simulations are acceptable. Total N uptake and grain N uptake were slightly over predicted.

Fig. 8.8. Predicted versus observed N uptake at maturity.

Fig. 8.9. Predicted versus observed grain N uptake.

Fig. 8.10. Predicted versus observed grain protein percentages.

Much of the error involved in simulation of total N uptake was related to poor simulation of N concentrations in the straw at harvest,

because grain N uptake was simulated fairly closely. Different harvesting techniques bear potential discrepancies, when e.g. substantial amounts of chaff or leaf material are included in the sample. The model does not account for losses of N through either leaching from harvest ripe straw or volatile losses from senescing leaves.

The scatter of points around the 1:1 line was higher for grain protein than for total and grain N uptake (Fig. 8.10). Only the modified Freese statistic indicates the model is still acceptable (Table 2). Simulation of grain protein content has been, to date, one of the most difficult components in the model. Determining whether a genetic factor controls some of the variation in grain N contents is difficult. Introduction of an additional genetic coefficient has so far been avoided.

For further details of the performance of aspects of the N routines see Godwin (1987).

GRAIN YIELD RESPONSE TO SOWING CONDITIONS, IRRIGATION AND N APPLICATION

Sowing Density

Response curves to major input variables are a pictorial method to demonstrate the model's ability to react correctly to varying input. Sowing density is very common to vary with site or management system. Climatic conditions do not often allow for high densities, because stored soil water plus rainfall are not sufficient to support a large number of plants per unit area of land.

Lelystad is a site in the Netherlands where water is not a limiting factor. Performance of crops planted at densities ranging from five to 800 plants/m² was compared focusing on the tillering process. In Fig. 8.11, observed and simulated yields in response to increasing sowing density are given, demonstrating that even for this environment the optimal sowing density is between 100 and 200 plants/m².

Fig. 8.11 Lelystad 1977. Observed and simulated yields in response to increasing sowing density.

The model provides a very similar response. However, the difference between observed and simulated yields is 2238 kg/ha at the lowest sowing density of five plants/m² and thus exceeds the + 1 standard deviation limit. In absolute terms this seems to be a large error. In the context of the entire experiment, however, it appears to be acceptable.

Irrigation

Comparisons of predicted and observed grain yields at 120 kg N/ha and different irrigation schedules are shown for three varieties at the arid wheat growing site, Tel Hadya in Syria (Fig. 8.12). Increasing fertilizer use with increasing amounts of irrigation was observed and also simulated. Varietal differences were higher in simulations than in reality. The model consistently over estimated N response in variety Sonalinka and the effect of none or little irrigation in variety Mexipak.

N Application

Out of a number of response curves of grain yield response to differing N application, Fig. 8.13 demonstrates the capability of the CERES-Wheat model to simulate location specific responses to N fertilizer application. Over a range of 0 to 210 kg N applied per ha, the model provided grain yields which were very close to measured yields at Rothamsted in 1975. The optimum application rate and differences between varieties were also reflected in these response curves.

In another location, Flevopolder, The Netherlands, no response was shown by the model for N rates ranging from 50 to 200 kg/ha, although a 25 percent increase in yield could be measured (Fig. 8.13b).

Fig. 8.12. Comparison of predicted and observed grain yields at differing fertilizer rates for three varieties with different irrigation strategies at Tel Hadya Syria, 1980.

Fig. 8.13. Comparison of predicted and observed response to applied N to the application pattern of N in individual data sets at Rothamsted, United Kingdom, 1975.

Fig. 8.13b. Comparison of predicted and observed response to applied N to the application pattern of N in individual data sets at Flevopolder, Netherlands, 1975.

The amount of fertilizer applied and the time of split application may result in different grain yields. For three locations, Lancelin (Western Australia), Bozeman (Montana), and Wageningen (Netherlands), simulated and observed yields are compared for five to seven different fertilizer strategies (Fig. 8.14). Response pattern was reflected well in model simulations. In cases where simulated yields did not coincide with observations, yield was over estimated at different degrees.

SEASONAL PATTERN OF MODEL OUTPUT COMPARED TO OBSERVATIONS

Tracing the seasonal pattern of different aspects of plant growth, such as different plant parts, LAI or tiller development, and checking against real world observations is another way to locate errors in the simulation process thus ensuring accuracy. A few examples of experimental data used for this purpose follows.

Fig. 8.14. Comparison of predicted and observed grain yield response to differing fertilizer split application patterns.

Dry Matter

The seasonal pattern of dry matter has been checked intensively. A representative example for a very good match with data taken from a crop, which was managed like a farmer's field, comes from Weihenstephan, West Germany, 1983 (Fig. 8.15). Two contrasting varieties, Caribo and Maris Mardler, the latter being more adapted to a cool, humid climate, performed well and produced matching yields. The timing and amount of biomass production was in accord with observations.

Fig. 8.15. CERES-Wheat (non-nitrogen) model validation results from individual experiments; predicted versus observed at Weihenstephan, West Germany 1983.

Dry Matter Partitioning

Several experiments have been conducted to evaluate dry matter partitioning between plant parts over the course of the growing season, e.g. in Nottingham, 1975. Data from this experiment were compared to model values (Fig. 8.16). LAI and root weight were over estimated by the model resulting in slightly higher than observed total dry matter values. Partitioning plants into leafs, stems, and ears, which are sinks in different stages of the plant life cycle, illustrates which parts may not have been modelled satisfactorily. In this case, maximum predicted leaf weight was about 400 g per m² higher than observed. This completely makes up for the amount of over estimated dry matter, because stem and ear weights were in perfect accord with observations in the experiment.

A different situation was found in a 1971 experiment in Rutherglen, Australia, where measurements were taken on a biweekly basis (Fig. 8.17). Ear weight was simulated well, but stem weights were missed considerably. This could be related to a different definition of plant parts, such as leaf sheaths being included either in the stem weight or leaf weight. The amount of assimilate partitioned to the roots was obviously too high over the whole growing season.

Leaf Area Index (LAI)

Fig. 8.18 compares model produced LAI values with observed LAI in an experiment in Roodeplaat, South Africa, where varying amounts of irrigation water was applied at different times. Plots receiving higher amounts of water (irrigations 3 and 5) produced denser canopies which could be simulated well. In irrigation 2 and 4, the timing of LAI build-

Fig. 8.16. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Nottingham, England, 1975; predicted versus observed.

Fig. 8.16. Continued.

Fig. 8.17. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Rutherglen, Australia, 1971; predicted versus observed.

Fig. 8.17. Continued.

Fig. 8.18. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Roodeplaat, South Africa, 1978; predicted versus observed.

Fig. 8.18. Continued.

up was correct, but LAI was small. Irrigation 1 resulted in a different model leaf area development, where the timing was later than observed in the field and maximum LAI was close to 4 rather than 2.5 as measured in the experiment.

Tiller Development

Modeling tiller development satisfactorily has been one of the more difficult tasks, because this process is controlled by a set of environmental, inter- and intra-plant conditions. The effect of sowing density on tiller development under arid conditions was monitored in a 1984 experiment in Temple, Texas (Fig. 8.19). The normal shape of a tillering curve was observed at the common sowing density of 160 plants/m² and the model was able to simulate this curve perfectly. At 40 plants/m² the model missed the peak of tiller development observed in the field at ca. 1200 tillers/m². In the lowest sowing density of 10 plants/m² the model maintained the number of tillers built early in the season until harvest, whereas some degree of tiller abortion was observed in the field.

The first measurement in the highest density plot was taken about 90 days after sowing when the tiller number for this dense crop was at maximum and some plants had up to 40 tillers. Some error could be associated with the sampling technique so that missing this data point is not considered to be crucial. For the rest of the season there was good agreement between modelled and observed tillers with the final number of tillers being correct.

Fig. 8.19. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Temple, Texas, 1984; predicted versus observed.

CONCLUSIONS

The CERES-Wheat model is designed to be used as a management-oriented simulation model for a diversity of applications in all environments suitable for wheat growing. To make the model useful to an audience as wide as possible, the inputs must be minimal and they must be reasonably easy to attain or estimate from standard agricultural practice. Under these premises, the model simulates crop growth and development, response to N fertilizer, irrigation, sowing time and density reasonably reliably. Rigorous statistical analysis indicated acceptance of model simulations. However, some problems have yet to be resolved with the prediction of tiller development, N uptake and grain protein concentration as suggested by comparison with a few data sets. A slight tendency to over predict biomass and N uptake early in the season was noted. Further data sets are required to test more rigorously the soil N components of the model. Additional testing and refinement of the indicated parts of the model may be beneficial.