**VALIDATION PROCEDURES**

Testing model performance during development of a model usually results in calibration, which is, according to Penning de Vries (de Vries and von Laar, 1982) a "very restricted form of evaluation," and "adjustment of some parameters such that model behavior matches one set of real world data." It can "degrade simulation into curve fitting."

Before any model can be used with confidence, adequate validation or assessment of the magnitude of the errors that may result from their use should be performed. Model validation, in its simplest form, is a comparison between simulated and observed values.

Beyond comparisons, there are several statistical measures available to evaluate
the association between predicted and observed values, among them the correlation
coefficient (r) and its square, the coefficient of determination (r^{2}). Willmott (1982)
has pointed out that the main problem with this analysis is that the magnitudes of r and
r^{2} are not consistently related to the accuracy of prediction where accuracy is defined
as the degree to which model predictions approach the magnitudes of their observed
counterparts.

Test criteria have been separated into two groups, called summary measures and
difference measures. Summary measures include the mean of observed values (0) and
predicted values (P), the standard deviations of observations (S_{o}) and the predictions
(S_{p}), the slope (a) and intercept (b) of the least-squares regression:

In addition, an index of agreement (d) (Willmott, 1982) was calculated as follows:

n n

2 2

d = 1 - [ S (P - O ) / S (|P'| + |O'|) ], 0 __<__ d __<__ 1

i i i i

i=1 i=1

_ _

where P' = P - O and O' = O - O

i i i i

Though (d) is meant to be used mainly to determine the relative superiority of alternative models, it can be valued as a descriptive parameter of model performance. The more (d) approaches 1, the more accurate the model.

While summary measures describe the quality of simulation, difference measures try
to locate and quantify errors. The latter includes the mean absolute error (MAE), the
mean bias error (MBE), and the root mean square error (RMSE). They all are calculated
according to Willmott (1982) and based on the term (P_{i} - 0_{i}):

A) Mean Absolute Error (MAE):

n

MAE = S | P_{i} - 0_{i} | / n

i=1

B) Mean Bias Error (MBE):

n

MBE = S (P_{i} - 0_{i}) / n

i=1

C) Root Mean Square Error (RMSE):

n

RMSE = S (P_{i} - 0_{i})^{2} / n

i=1

MAE and RMSE indicate the magnitude of the average error, but provide no information on the relative size of the average difference between (P) and (0). MBE describes the direction of the error bias. Its value, however, is related to magnitude of values under investigation. A negative MBE occurs when predictions are smaller in value than observations.

In the test of data sets without N routines, attention was focused on the
unsystematic error (RMSE_{u}) as in the system:

For a good model, the unsystematic error RME should approach RMSE, with RMSE_{s} approaching
0/, which is what could be observed.

In addition to these basic tests, a number of procedures were tested using the data sets with N routines only. These are:

o A simple regression technique, suggested by Dent and Black (1979) combined with an F-test to evaluate the null hypothesis of the slope and intercept simultaneously, being different from zero and unity, respectively.

o A statistic to determine model accuracy as defined by Freese (1960).

o A 5 percent critical error as defined by Reynolds (1984).

For results of the above mentioned tests, see Tables 1 and 2.

The parameters examined in the statistical evaluation were:

1. Anthesis date

2. Maturity date

3. Leaf area index (LAI) at maximum

4. Total above ground dry matter at maturity

5. Grain yield

6. Individual grain weight at maturity

7. Number of grains per m_{2} at maturity

8. Dry matter at anthesis

9. N uptake by the crop at anthesis

10. N uptake in the above ground plant parts at maturity

11. N content of the grain

12. Grain protein percentages

Parameters 1 through 7 were evaluated without using N routines (see Tables 3 and 4), whereas parameters 4 through 12 specially referred to aspects of the N routines to be tested (see Tables 1 and 2).

__ __

Table 1. Summary Measures for Data Sets with N Routines

__ __

_ _

Variable Units N O P So Sp a b R d

__ __

Dry matter kg/ha 222 10,313 11,719 3,375 3,897 189.3 1.118 0.82 0.86

Grain yield kg/ha 240 3,953 4,227 1,716 1,719 145.6 1.033 0.84 0.91

Total N uptake kg N/ha 223 110 121 47 45 7.0 1.042 0.72 0.83

Grain N uptake kg N/ha 215 84 87 34 31 4.4 0.996 0.74 0.85

Grain protein percent 215 12.3 11.7 2.95 3.55 0.198 0.939 0.55 0.74

GPSM no. 152 12,381 11,861 4,668 5,324 362 0.929 0.63 0.78

Kernel weight mg 134 33.6 37.9 7.02 8.40 0.998 1.096 0.34 0.59

App recovery percent 137 45.7 43.4 29.4 25.68 8.37 0.767 0.37 0.65

__ __

__ __

Table 2. Difference Measures for Data Sets with N Routines

__ __

Variable Units N MAE MBE RMSE RT E^{*}

__ __

Dry Matter kg/ha 222 2,248.1 1,405.6 2,648.2 13.06 480.9

Grain yield kg/ha 240 806.1 274.4 1,024.2 15.39 186.5

Total N uptake kg N/ha 223 28.6 11.5 36.3 22.43 65.9

Grain N uptake kg N/ha 215 19.4 3.5 24.0 18.64 43.6

Grain protein percent 215 2.4 -0.6 3.2 18.65 5.8

GPSM no. 152 3,462.3 -520.8 4,355.4 22.43 778.6

Kernel weight mg 140 8.1 4.2 9.9 12.00 17.6

App recovery percent 137 23.5 -2.3 31.2 69.48 55.5

__ __

Note: __N__ = Number of observations.

__O__ = Mean of observations.

P = Mean of predictions.

SO = Standard deviation of observations.

SP = Standard deviation of predictions.

a = Intercept term from regression of predicted on observed.

b = Slope term from regression of predicted on observed.

R = Regression coefficient.

D = Index of agreement (Willmott, 1982).

MAE = Mean absolute error (Willmott, 1982).

MBE = Mean bias error (Willmott, 1982).

RMSE = Root mean square error.

RT = Model accuracy (Freese, 1960).

E^{*} = Five percent critical error as defined by Reynolds (1984).

GPSM = Grains per m^{2}

LAI = Leaf Area Index

Parameters 1 through 7 were evaluated without using N routines (see Tables 3 and 4), whereas parameters 4 through 12 specially refer to aspects of the N routines to be tested (see Tables 1 and 2).

__ __

__ __

Table 3. Summary Measures for Data Sets Without N Routines

__ __

_ _ __ P = a + b*0 __

Variable Unit n O P S_{o} Sp a b R d

__ __

Anthesis days 82 144.6 144.2 30.3 28.4 14.6 0.8960 0.912

Phys mat days 76 176.0 176.6 28.4 28.5 4.7 0.9763 0.947

Yield kg/ha 157 5,547.0 5,646.8 2,219.0 2,578.2 582.2 0.9166 0.633 0.88

Grain wt mg/kernel 144 38.3 36.4 9.3 9.4 13.4 0.6234 0.448

GPSM #/m^{2} 138 14,594.6 15,161.2 5,115.2 5,811.9 5,507.3 0.6715 0.376

Dry matter g/m^{2} 76 1,179.2 1,559.1 525.5 412.2 791.8 0.5052 0.411

LAI -- 54 4.4 5.2 2.6 1.6 3.3 0.4342 0.260

__ ____ __

Table 4. Difference Measures for Data Sets Without N Routines

__ __

Max Min

Variable Unit n MBE MAE Error Error RMSE_{u}

__ __

Anthesis days 82 -0.4 5.67 32.0 0. 8.5

Phys mat days 76 0.6 5.1 18.0 0. 6.6

Yield kg/ha 157 99.8 1,272.8 3,468.0 6. 1,552.0

Grain wt mg/kernel 144 -1.9 5.8 28.4 0. 6.5

GPSM #/m^{2} 138 566.6 3,444.1 17,446.0 17. 4,437.5

Dry matter g/m^{2} 76 379.9 352.0 1,324.0 2. 320.0

LAI -- 54 0.8 1.8 5.7 0. 1.9

__ __

**DATA BASE**

Since model development and testing is an iterative process in the early stages, most of the data sets have been utilized during the development phase and are not truly independent. Since development of

CERES-Wheat-N has necessarily lagged behind the development of CERES-Wheat, the opportunity to rigorously test the model with a large base of truly independent data sets has not yet arisen. Development and testing of the CERES-Maize-N model (Jones and Kiniry, 1986) has paralleled the work on the wheat model. Because the soil N transformation components of both models are identical and since the basic structure of the CERES-Wheat model dictates the nature of biomass production, yield component determination, and water balance, the testing data base can be inferred as having a sufficient degree of independence from model development. Other data sets have been used extensively in model development and calibration which have not been included in the analysis.

To test the CERES-Wheat model under different growing conditions, a data base was
assembled to represent a diversity of environments, including short growing season spring
wheat crops, environments with limited water availability, sub-tropical wheat growing
areas with little vernalization, and regions with temperature extremes. The data base
represents a time span of 25 years (1959 - 1984) and a range of latitudes from 54^{o} N (West
Germany) to 36^{o} S (Australia) (see Fig. 8.1). This includes published experimental data
as well as unpublished dissertations and other unpublished sources.

Fig. 8.1. Location of experimental sites.

Because a complete minimum data set was rarely available, the additional climatic and soils information was obtained mainly through personal communication. When a few key data were unavailable, certain model inputs were estimated using the best available information and the approaches described in Chapter 5.

A good data set should contain as much detail as possible to describe the process of plant growth. However, taking a sufficient number of samplings throughout the season is time consuming and expensive and was therefore performed only in experiments especially laid out for model testing. These were particularly valuable, because results of such experiments allowed for the decription of the time course of several aspects of plant growth and comparison with model output.

For a detailed description of individual data sets see Otter-Nacke, Godwin, and Ritchie (1986).

**DIFFERENCE MEASURES AND SUMMARY STATISTICS**

__Phasic Development__

The accuracy of simulating the phasic development of a crop is crucial for accurate simulation of crop growth and yield. Thus, evaluation of phasic development should be the first step. Observed and predicted means are close together with small RMSE, and similar standard deviations. While the intercept (a) is relatively close to zero and the slope (b) relatively close to unity for physiological maturity, this is not the case for anthesis date (see Tables 3 and 4).

__Grain Yield and Components__

Predicted yields versus observed yields are depicted in Fig. 8.2 and 8.3. Most
predictions are within the limits of __+__ 1.0 standard deviation. In only 30 out of 240 and
24 out of 157 cases, the difference between observed and predicted yield was larger than
1 standard deviation. Comparing the difference measure (d) for runs with and without
account for N application summarizes the differences of all of these statistical
parameters. D-values amount to 0.91 and 0.88, which indicates the model, including N-
routines, is more accurate compared to runs without N routines. This is not a thorough
test, however, because runs were performed on different data sets. CERES-Wheat tends to
over predict yield. Means and standard deviations of predicted yields using or bypassing
N routines approach those of the observations (Tables 1 and 3).

Fig. 8.2. Grain yield. Comparison of predictions of the CERES-WHEAT model with observed data from experiments.

^{_____} is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of __+__ 1 standard deviation from the regression line.

Fig. 8.3 Predicted versus measured yield.

Kernel weight and grain number per m^{2} are the two composites of grain yield.
Performance of the model in predicting these parameters was in general poorer than the
simulation of grain yield. MBE indicates the model to over estimate grain weight under
fertilized conditions (Table 2), whereas with N- routines switched off, the negative MBE
suggests the contrary (Table 4). This is surprising because N is assumed to be non-limiting in the latter case. The intercepts and slopes of the corresponding regressions
are likely to be significantly different from zero and unity (Tables 1 and 3).

This is also obvious for the regression of prediction on observed number of kernels
per m^{2} without account for N effects (Fig. 8.4). Though the regression line is closer to
the 1:1 line when testing N application data sets (Fig. 8.5) there is still considerable
scatter outside the standard deviation limits. Error biases of kernel number and kernel
weight are alternating, suggesting the two genetic parameters G1 and G2 (responsible for
the determination of kernel number and kernel weight) being estimated poorly to accomodate
data sets of different origin. MAE and RMSE are of similar value.

Fig. 8.4. Predicted versus measured number of grains.

Fig. 8.5. Grains /m^{**}2. Comparison of predictions of the CERES-WHEAT model with observed
data from experiments.

^{_____} is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of __+__ 1 standard deviation from the regression line.

__Dry Matter and LAI__

More scatter around the 1:1 line occurred for biomass predictions than for grain yield (Fig. 8.6 and 8.7). Means and standard deviations of predicted and observed dry matter differ more in data sets without account for N.

Fig. 8.6. Dry matter. Comparison of predictions of the CERES-WHEAT model with observed data from experiments.

^{_____} is the 1:1 line.

----- is the regression line between observed and predicted.

- - - Mark the boundary of __+__ 1 standard deviation from the regression line.

Fig. 8.7. Predicted versus measured above ground dry matter.

MBEs are positive indicating a tendency to over predict dry matter production.
Other coefficients show the model simulation is acceptable over a range of total dry
matter yields from 448 to 3389 g/m^{2}. The slope of the regression line for N trials (Fig.
8.6) supports the observation of over prediction. In the N non-limiting sets, the slope
is only 0.5052 mainly due to a few outlyers at the high end of the range where dry matter
was strongly under predicted.

Summary and difference measures only are given for leaf area index (LAI) (Table 3 and 4) because the error associated with measured values of LAI are usually very high. In this particular case, maximum LAI was requested for comparison between predicted and observed values. LAI measurements however, being time consuming, were not taken often enough to meet the peak of leaf area development. MBE being 0.8, indicates the model to over predict LAI by 18 percent on the average. Standard deviations are quite different, and intercept and slope are definitely different from zero and unity. These statistical values are not particularly meaningful, however, under the above premises.

Total N uptake, grain N uptake and grain protein values are of special interest
when the nitrogen switch in the model is turned on. Simulation of these parameters is
related to yield simulation, but did not excel as yield simulation. Forty-four points
from a total of 223 fall outside the bounds of __+__ 1.0 standard deviation of the 1:1 line
for total N uptake (Fig. 8.8). For grain N uptake and grain protein percentage (Fig. 8.9
and 10) this was 31 from 215 and 60 from 215. Intercept and slope are just beyond the
5 percent confidence intervals, whereas other statistics suggest the simulations are
acceptable. Total N uptake and grain N uptake were slightly over predicted.

Fig. 8.8. Predicted versus observed N uptake at maturity.

Fig. 8.9. Predicted versus observed grain N uptake.

Fig. 8.10. Predicted versus observed grain protein percentages.

Much of the error involved in simulation of total N uptake was related to poor simulation of N concentrations in the straw at harvest,

because grain N uptake was simulated fairly closely. Different harvesting techniques bear potential discrepancies, when e.g. substantial amounts of chaff or leaf material are included in the sample. The model does not account for losses of N through either leaching from harvest ripe straw or volatile losses from senescing leaves.

The scatter of points around the 1:1 line was higher for grain protein than for total and grain N uptake (Fig. 8.10). Only the modified Freese statistic indicates the model is still acceptable (Table 2). Simulation of grain protein content has been, to date, one of the most difficult components in the model. Determining whether a genetic factor controls some of the variation in grain N contents is difficult. Introduction of an additional genetic coefficient has so far been avoided.

For further details of the performance of aspects of the N routines see Godwin (1987).

**GRAIN YIELD RESPONSE TO SOWING CONDITIONS, IRRIGATION AND N APPLICATION**

__Sowing Density__

Response curves to major input variables are a pictorial method to demonstrate the model's ability to react correctly to varying input. Sowing density is very common to vary with site or management system. Climatic conditions do not often allow for high densities, because stored soil water plus rainfall are not sufficient to support a large number of plants per unit area of land.

Lelystad is a site in the Netherlands where water is not a limiting factor.
Performance of crops planted at densities ranging from five to 800 plants/m^{2} was compared
focusing on the tillering process. In Fig. 8.11, observed and simulated yields in
response to increasing sowing density are given, demonstrating that even for this
environment the optimal sowing density is between 100 and 200 plants/m^{2}.

Fig. 8.11 Lelystad 1977. Observed and simulated yields in response to increasing sowing density.

The model provides a very similar response. However, the difference between
observed and simulated yields is 2238 kg/ha at the lowest sowing density of five plants/m^{2}
and thus exceeds the __+__ 1 standard deviation limit. In absolute terms this seems to be
a large error. In the context of the entire experiment, however, it appears to be
acceptable.

__Irrigation__

Comparisons of predicted and observed grain yields at 120 kg N/ha and different irrigation schedules are shown for three varieties at the arid wheat growing site, Tel Hadya in Syria (Fig. 8.12). Increasing fertilizer use with increasing amounts of irrigation was observed and also simulated. Varietal differences were higher in simulations than in reality. The model consistently over estimated N response in variety Sonalinka and the effect of none or little irrigation in variety Mexipak.

__N Application__

Out of a number of response curves of grain yield response to differing N application, Fig. 8.13 demonstrates the capability of the CERES-Wheat model to simulate location specific responses to N fertilizer application. Over a range of 0 to 210 kg N applied per ha, the model provided grain yields which were very close to measured yields at Rothamsted in 1975. The optimum application rate and differences between varieties were also reflected in these response curves.

In another location, Flevopolder, The Netherlands, no response was shown by the model for N rates ranging from 50 to 200 kg/ha, although a 25 percent increase in yield could be measured (Fig. 8.13b).

Fig. 8.12. Comparison of predicted and observed grain yields at differing fertilizer rates for three varieties with different irrigation strategies at Tel Hadya Syria, 1980.

Fig. 8.13. Comparison of predicted and observed response to applied N to the application pattern of N in individual data sets at Rothamsted, United Kingdom, 1975.

Fig. 8.13b. Comparison of predicted and observed response to applied N to the application pattern of N in individual data sets at Flevopolder, Netherlands, 1975.

The amount of fertilizer applied and the time of split application may result in different grain yields. For three locations, Lancelin (Western Australia), Bozeman (Montana), and Wageningen (Netherlands), simulated and observed yields are compared for five to seven different fertilizer strategies (Fig. 8.14). Response pattern was reflected well in model simulations. In cases where simulated yields did not coincide with observations, yield was over estimated at different degrees.

**SEASONAL PATTERN OF MODEL OUTPUT COMPARED TO OBSERVATIONS**

__ Tracing the seasonal pattern of different aspects of plant growth, such as
different plant parts, LAI or tiller development, and checking against real world
observations is another way to locate errors in the simulation process thus ensuring
accuracy. A few examples of experimental data used for this purpose follows.__

Fig. 8.14. Comparison of predicted and observed grain yield response to differing fertilizer split application patterns.

__Dry Matter__

The seasonal pattern of dry matter has been checked intensively. A representative example for a very good match with data taken from a crop, which was managed like a farmer's field, comes from Weihenstephan, West Germany, 1983 (Fig. 8.15). Two contrasting varieties, Caribo and Maris Mardler, the latter being more adapted to a cool, humid climate, performed well and produced matching yields. The timing and amount of biomass production was in accord with observations.

Fig. 8.15. CERES-Wheat (non-nitrogen) model validation results from individual experiments; predicted versus observed at Weihenstephan, West Germany 1983.

__Dry Matter Partitioning__

Several experiments have been conducted to evaluate dry matter partitioning between
plant parts over the course of the growing season, e.g. in Nottingham, 1975. Data from
this experiment were compared to model values (Fig. 8.16). LAI and root weight were over
estimated by the model resulting in slightly higher than observed total dry matter values.
Partitioning plants into leafs, stems, and ears, which are sinks in different stages of
the plant life cycle, illustrates which parts may not have been modelled satisfactorily.
In this case, maximum predicted leaf weight was about 400 g per m^{2} higher than observed.
This completely makes up for the amount of over estimated dry matter, because stem and
ear weights were in perfect accord with observations in the experiment.

A different situation was found in a 1971 experiment in Rutherglen, Australia, where measurements were taken on a biweekly basis (Fig. 8.17). Ear weight was simulated well, but stem weights were missed considerably. This could be related to a different definition of plant parts, such as leaf sheaths being included either in the stem weight or leaf weight. The amount of assimilate partitioned to the roots was obviously too high over the whole growing season.

__Leaf Area Index (LAI)__

Fig. 8.18 compares model produced LAI values with observed LAI in an experiment in Roodeplaat, South Africa, where varying amounts of irrigation water was applied at different times. Plots receiving higher amounts of water (irrigations 3 and 5) produced denser canopies which could be simulated well. In irrigation 2 and 4, the timing of LAI build-

Fig. 8.16. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Nottingham, England, 1975; predicted versus observed.

Fig. 8.16. Continued.

Fig. 8.17. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Rutherglen, Australia, 1971; predicted versus observed.

Fig. 8.17. Continued.

Fig. 8.18. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Roodeplaat, South Africa, 1978; predicted versus observed.

Fig. 8.18. Continued.

up was correct, but LAI was small. Irrigation 1 resulted in a different model leaf area development, where the timing was later than observed in the field and maximum LAI was close to 4 rather than 2.5 as measured in the experiment.

__Tiller Development__

Modeling tiller development satisfactorily has been one of the more difficult
tasks, because this process is controlled by a set of environmental, inter- and intra-plant conditions. The effect of sowing density on tiller development under arid
conditions was monitored in a 1984 experiment in Temple, Texas (Fig. 8.19). The normal
shape of a tillering curve was observed at the common sowing density of 160 plants/m^{2} and
the model was able to simulate this curve perfectly. At 40 plants/m^{2} the model missed the
peak of tiller development observed in the field at ca. 1200 tillers/m^{2}. In the lowest
sowing density of 10 plants/m^{2} the model maintained the number of tillers built early in
the season until harvest, whereas some degree of tiller abortion was observed in the
field.

The first measurement in the highest density plot was taken about 90 days after sowing when the tiller number for this dense crop was at maximum and some plants had up to 40 tillers. Some error could be associated with the sampling technique so that missing this data point is not considered to be crucial. For the rest of the season there was good agreement between modelled and observed tillers with the final number of tillers being correct.

Fig. 8.19. CERES-Wheat (non-nitrogen) model validation results from individual experiments at Temple, Texas, 1984; predicted versus observed.

**CONCLUSIONS**

The CERES-Wheat model is designed to be used as a management-oriented simulation model for a diversity of applications in all environments suitable for wheat growing. To make the model useful to an audience as wide as possible, the inputs must be minimal and they must be reasonably easy to attain or estimate from standard agricultural practice. Under these premises, the model simulates crop growth and development, response to N fertilizer, irrigation, sowing time and density reasonably reliably. Rigorous statistical analysis indicated acceptance of model simulations. However, some problems have yet to be resolved with the prediction of tiller development, N uptake and grain protein concentration as suggested by comparison with a few data sets. A slight tendency to over predict biomass and N uptake early in the season was noted. Further data sets are required to test more rigorously the soil N components of the model. Additional testing and refinement of the indicated parts of the model may be beneficial.