1. INTRODUCTION

Definition and Goals of Climate Seasonal Forecast Assessment

A Climate Seasonal Forecast Assessment refers to the systematic evaluation of seasonal climate forecasts, which typically predict weather patterns over several months. This process involves analyzing the accuracy, skill, and reliability of these forecasts to determine how closely they match the actual weather outcomes (observation). The primary goal of this assessment is to verify whether the forecasted seasonal conditions, such as temperature, precipitation, and wind patterns, align with the observed climate data for the given period.

The verification aspect plays a crucial role in this assessment, as it measures the forecast’s skill and consistency. Verification involves comparing the predicted values with real-world data, often using statistical techniques to determine the forecast’s accuracy. The results of the verification help meteorologists refine forecasting models, identify areas for improvement, and better understand the underlying climatic variables that influence seasonal patterns. Validation is the confirmation through testing and providing objective evidence that certain requirements for a specific purpose are met by a model. Verification, on the other hand, is the process of comparing model calculations (forecasts) with actual values (observations), which is generally equated with validation (Jolliffe and Stephenson 2011).

The type of verification adjusts to the type of forecasts, here are some examples of the types of forecasts and verification (WWRP 2017)

Nature of forecast:	Example(s):	Verification methods:
deterministic (non-probabilistic)	quantitative precipitation forecast	visual, dichotomous, multi-category, continuous, spatial
probabilistic	probability of precipitation, ensemble forecast	visual, probabilistic, ensemble
qualitative (worded)	5-day outlook	visual, dichotomous, multi-category
Space-time domain:
time series	daily maximum temperature forecasts for a city	visual, dichotomous, multi-category, continuous, probabilistic
spatial distribution	map of geopotential height, rainfall chart	visual, dichotomous, multi-category, continuous, probabilistic, spatial, ensemble
pooled space and time	monthly average global temperature anomaly	dichotomous, multi-category, continuous, probabilistic, ensemble
Specificity of forecast:
dichotomous (yes/no)	occurrence of fog	visual, dichotomous, probabilistic, spatial, ensemble
multi-category	cold, normal, or warm conditions	visual, multi-category, probabilistic, spatial, ensemble
continuous	maximum temperature	visual, continuous, probabilistic, spatial, ensemble
object- or event-oriented	tropical cyclone motion and intensity	visual, dichotomous, multi-category, continuous, probabilistic, spatial

Allan Murphy (Murphy 1993), a pioneer in the field of prediction verification, distinguishes 3 types of ‘goodness’ of a prediction :

Consistency - the degree to which the forecast corresponds to the forecaster’s best judgement about the situation, based upon his/her knowledge base.
Quality - the degree to which the forecast corresponds to what actually happened (observation).
Value - the degree to which the forecast helps a decision maker to realize some incremental economic and/or other benefit.

In addition, Murphy also mentioned that there are 8 aspects (attributes) that show the quality of a forecast, including:

No.	Aspects/Attributes	Description	Scores/metrics that can be used
1	Bias	Difference between predicted and observed means	Mean Error (ME), Mean Square Error (MSE), Scatter plot
2	Accuracy	The degree of agreement between prediction and observation	Continuous Rank Probability Score (CRPS), Taylor Diagram
3	Uncertainty	The diversity of observation values, the greater the uncertainty of the observation, the more difficult it is to predict	Verification Rank Histogram (VRH)
4	Sharpness	The ability of the forecast to predict extreme values. Sharpness is “only” possessed by predictions (not observations); even poor predictions still have the attribute of sharpness	Ensemble Spread (SPRD), Spread Skill Relationship (SSR)
5	Resolution	The ability of a forecast to represent how much the prediction differs from the probabilistic mean of the climatology of an event and whether the prediction system can predict it correctly; Typically used to measure the mean square of probabilistic prediction error	Brier Score (BS), Relative Operating Characteristic (ROC)
6	Discrimination	The forecast’s ability to clearly distinguish a situation that leads to the occurrence or non-occurrence of an event	Relative Operating Characteristic (ROC)
7	Reliability	Statistical consistency between the probabilistic prediction of an event and the actual frequency of occurrence	Reliability Diagram (RD), Brier Score (BS)
8	Skill	The relative accuracy of a prediction model to a reference (climatological conditions) or an increase in prediction accuracy due to an improved prediction system; Reliability is used to measure the superiority of a prediction system based on a baseline of past observations	Skill Score (SS): BSS, CRPSS, ROCSS

Examples of Evaluations for Climate Seasonal Forecasts

Some meteorological agencies in other countries have evaluated their predictions using several verification metrics, for example:

$~$

2. HOW IT IS EVALUATED?

Approaches for Assessing Climate Seasonal Forecast

There are several approaches that are commonly used to assess seasonal climate forecasts (Wilks 2011; Jolliffe and Stephenson 2011; WWRP 2017), such as:

No.	Metric	Formulation	Description
1	Mean Error (ME)	$\text{ME}=\frac{\sum_{i=1}^{n}\left(f_i-o_i\right)}{n}$	$n=$ Number of data pairs (forecast & observation); $f_i=$ Forecast value; $o_i=$ Observatioin value; Best Score $=0$; Verification method = continuous;
2	Mean Absolute Error (MAE)	$\text{MAE}=\frac{\sum_{i=1}^{n}\left\|f_i-o_i\right\|}{n}$	$n=$ Number of data pairs (forecast & observation); $f_i=$ Forecast value; $o_i=$ Observatioin value; Best Score $=0$; Verification method = continuous;
3	Root Mean Square Error (RMSE)	$\text{RMSE}=\sqrt{\frac{\sum_{i=0}^{N - 1} (f_i - o_i)^2}{N}}$	$N=$ Number of data pairs (forecast & observation); $f_i=$ Forecast value; $o_i=$ Observatioin value; Best Score $=0$; Verification method = continuous;
4	Boxplot		$Q_1=(n+1)0.25$; $Q_2=(n+1)0.5$; $Q_3=(n+1)*0.75$; $IQR=Q_3-Q_1$; Best Score = Closest to Observation; Verification method = continuous, visual;
5	Scatter plot		Best Score = gather around the diagonal; Verification method = continuous, visual;
6	Correlation Coefficient $(r)$	$r=\frac{\sum{(f_i-\overline{f})(o_i-\overline{o})}}{\sqrt{\sum{(f_i-\overline{f})^2}}\sqrt{\sum{(o_i-\overline{o})^2}}}$	$f_i=$ i-th forecast value; $\overline{f}=$ mean of forecast values; $o_i=$ i-th observation value; $\overline{f}=$ mean of observation values; Best Score = 1; Verification method = continuous;
7	Dichotomous Contingency Table		$\text{HR} = \frac{a}{a + c} = \hat{p}(\hat{x} = 1 \mid x = 1)$; $\text{PC or Accuracy} = \frac{a + d}{n} = \hat{p}\left[(\hat{x} = 1, x = 1) \, \text{or} \, (\hat{x} = 0, x = 0)\right]$; $\text{FAR} = \frac{b}{a + b} = \hat{p}(x = 0 \mid \hat{x} = 1)$; $\text{CSI} = \frac{a}{a + b + c}$; ROC
8	Multi-Category Contingency Table		$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{K}n(F_i, O_i)$; $\text{HSS} = \frac{\frac{1}{N} \sum_{i=1}^{K} n(F_i, O_i)-\frac{1}{N^2} \sum_{i=1}^{K} N(F_i) N(O_i)}{1 - \frac{1}{N^2} \sum_{i=1}^{K} N(F_i) N(O_i)}$; Best Score Accuracy=1; Best Score HSS=1

Skill Score Method

Prediction of the onset of the Indonesian season by BMKG for both the wet and dry seasons is based on rainfall values obtained from the output of several models. These models consist of dynamic and statistical models that have their own advantages and disadvantages. The number of models used will certainly cause more uncertainty in the prediction of season onset. To overcome this, several things can be done :

Use the ensemble mean of all rainfall models and then determine the onset of the season (at ZOM area);
Using the season onset mode from all models at the ZOM area;
Using the best model output in the ZOM area, and the best model can be found with:
- Model evaluation based on historical predictions and observations of season onset
- Model evaluation based on historical prediction and observation of rainfall

Some of the metrics previously described can be used to evaluate the rainfall output of models used in seasonal prediction. Here we will use an evaluation metric called the Taylor Skill Score (TSS)(Xiaoli Yang and Sheffield 2020) or commonly called Skill Score (SS), whose basis is the Taylor Diagram. The Taylor Diagram has 3 output results (Correlation Coefficient, RMSE, and ${NSD}_m$) drawn in one graph. If referring to each of the Taylor Diagram outputs, it will certainly be difficult to determine the best model. TSS combines the 3 outputs that produce a value that states the skill of the rainfall prediction model being evaluated.

\[\begin{equation} \tag{1} SS = \frac{4(1 + CC)^4}{\left( \text{NSD}_m + \frac{1}{\text{NSD}_m} \right)^2 (1 + CC_0)^4} \end{equation}\] \[\begin{equation} \tag{2} \text{NSD}_m = \frac{\sigma_{\text{mod}}}{\sigma_{\text{obs}}} \end{equation}\]

Where: \[ \text{NSD}_m = \text{Normalized standard deviation of the simulation (model)} \]

\[ \sigma_{\text{mod}} = \text{Standard deviation of the simulation (model)} \]

\[ \sigma_{\text{obs}} = \text{Standard deviation of the observation} \]

\[ m = \text{Model value simulation} \]

\[ \text{CC}_0 = \text{Maximum correlation coefficient usually 1} \]

\[ \text{CC} = \text{Correlation coefficient between the simulated (model) and observed data} \]

The closer SS is to 1, the better the ability of the individual model to represent the observations

$~$

3. RAINFALL PREDICTION EVALUATION (FOR SEASON ONSET DETERMINATION)

Assessment at ZOM 9120

Here is an example of rainfall forecast data from several models (Initial January 2011) and observations, for 21 dekad days (ZOM Aceh_01):

CODE TO CREATE A TAYLOR DIAGRAM

library(openair)
library(Metrics)
library(tidyr)
#data <- read.csv('path/to/your/data.csv')
t_dt3 <- pivot_longer(data = data, cols = colnames(data[!colnames(data) %in% c('Jan_2011','OBS')]), names_to = 'MODEL', values_to = 'CH')
t_dt3$MODEL <- factor(t_dt3$MODEL, levels = unique(t_dt3$MODEL))
p2 <- TaylorDiagram(t_dt3, obs = "OBS", mod = "CH", group = "MODEL", normalise = T, cols = c('cornflowerblue','blue','yellow2','deeppink','deeppink4','orange1','orange4'),
                    text.obs = 'Observasi', key = T, main=paste0('Taylor Diagram'))

	ECMWF_RAW	ECMWF_COR	MME1	CFSv2_RAW	CFSv2_COR	ARIMA	WARIMA
RMSE	42.9393894	32.8570424	38.5247122	46.4539831	31.4249754	43.5126742	42.0122118
CC	0.6972082	0.7312276	0.5420013	0.4109492	0.7422360	0.3165167	0.4026214
NSDm	0.6112191	0.5841298	0.6217545	0.6297243	0.7395452	0.3083050	0.3535605
SS	0.4107342	0.4259743	0.2842024	0.2014521	0.5264507	0.0595302	0.0955697
W	0.2049660	0.2125712	0.1418237	0.1005293	0.2627113	0.0297070	0.0476915

Based on the above example, the best model in ZOM ACEH_01 for the January 2011 prediction initial is the CFSv2_COR Model. SS calculations should be carried out throughout the availability of data (in this example from 2011 - 2020) and the entire initial (January - December). So that the SS will be obtained for each month (January - December) for 10 years (2011 - 2020). Based on all the SS obtained, the average SS for the 10 years can be calculated and the best model with the highest SS is obtained.

BEST MODEL FOR EACH ZOM BASED ON AVERAGE SS 2011-2020

Select Metric:

Select Init:

Using the Skill Score method to Interpret the Evaluation

Based on the highest Skill Score (maximum 1), we will get the best model in each ZOM for each initial release. From the tabular results above, we can display the Skill Score value and the best model for each ZOM spatially.

BEST MODEL FOR EACH ZOM BASED ON AVERAGE SS 2011-2020 ON INITIAL JANUARY

Then, we can also see the distribution of Skill Scores for 10 years (2011-2020) using boxplot to find out what the dominant skill score is.

EXAMPLE OF SKILL SCORE DISTRIBUTION IN SEVERAL ZOMS IN A GIVEN MONTH (2011-2020)

SS FOR 2011-2020 AT ZOM ACEH_01

CODE TO CREATE A BOXPLOT

#ss_aceh01 <- read.csv('path/to/your/data.csv')
plot.new()
boxplot(x = ss_aceh01[,3:ncol(ss_aceh01)], lwd = 2, yaxt='n', ylim=c(0,1), col = 'white', medlwd=2, cex.axis=0.5)
points(x = 1:ncol(ss_aceh01[,3:ncol(ss_aceh01)]),apply(X = ss_aceh01[,3:ncol(ss_aceh01)], MARGIN = 2,FUN=mean),pch=21,col='blue3',lwd=2)
for (i in 1:ncol(ss_aceh01)-2) {
  points(jitter(rep(i, nrow(ss_aceh01))), ss_aceh01[[i+2]], col = "red", pch = 16, cex=0.45)
}
text(x=1:ncol(ss_aceh01[,3:ncol(ss_aceh01)]), y=apply(X = ss_aceh01[,3:ncol(ss_aceh01)], MARGIN = 2, FUN=mean)+0.02, labels= paste0('Mean = ', round(apply(X = ss_aceh01[,3:ncol(ss_aceh01)], MARGIN = 2, FUN=mean),2)), cex=0.5)
axis(side=2, at=seq(0,1,0.1), labels=seq(0,1,0.1), cex.axis=0.8)
segments(x0 = seq(0,7,0.5), y0 = 0, y1 = 1, lty=2, lwd=0.5,col = 'lightgrey')
segments(x0 = 0, x1 = 7, y0 = seq(0,1,0.1), lty=2, lwd=0.5,col = 'lightgrey')
mtext(~ italic('Skill Score (SS)'), side = 2, line = 2, cex = 0.8)
title(main = paste0('Boxplot Skill Score ZOM : ACEH_01'), line = 0.5)
mtext(~ italic('Periode 2011 - 2020'), side = 1, line = par('usr')[3]-1, col = "black", cex = 0.6, at=par('usr')[2]-0.7)
mtext('Init : JAN', side = 3, font=3,line = par('usr')[3]-1, col = "darkblue", cex = 0.8, at=par('usr')[2]-0.4)

4. RAINFALL WEIGHTING

Seasonal Rainfall Prediction Weighting based on Skill Score

From the highest Skill Score results, we will know the best model for each initial in each ZOM. For season onset prediction, we can use the best model that is already known according to the initial month used as a seasonal prediction of rainfall. From the prediction of rainfall values based on this best model, the season onset can then be determined.

Another method that we can do if we still want to use all the rainfall seasonal prediction model outputs is by weighting each rainfall model output. This weight can be obtained based on the Skill Score value of each model that we have calculated. Mathematically, it can be formulated as follows:

\[\begin{equation} \tag{3} W_i = \frac{SS_i}{\sum_{j=1}^n SS_j} \end{equation}\]

\[\begin{equation} \tag{4} \sum_{i=1}^n W_i = 1 \end{equation}\]

Where: \[ SS_i = \text{$i$-th model Skill Score, where } i = 1, 2, \dots, n \]

\[ W_i = \text{$i$-th model Weight, where } i = 1, 2, \dots, n \]

Seasonal Rainfall Prediction ZOM 9120 using Skill Score-based weighting

We will do the weight calculation as practice. In this case we are using ZOM BALI_11 with the initials February. From the results of the weight calculation (W), if we add up the total is 1.

	ZOM	TYPE	INIT	MME1	ECMWF_RAW	ECMWF_COR	CFSv2_RAW	CFSv2_COR	ARIMA	WARIMA
25896	BALI_11	SS	Feb	0.4786296	0.4334377	0.5389604	0.0695950	0.5097606	0.5386813	0.4729669
34284	BALI_11	W	Feb	0.1543724	0.1430077	0.1771860	0.0249932	0.1750009	0.1737069	0.1517329

Next, we will weight the February 2024 initial seasonal rainfall prediction (generally used for dry season prediction). We multiply each model with the weight (W) that we have obtained in the previous weight calculation.

SEASONAL RAINFALL PREDICTION DATA INITIAL FEBRUARY 2024

DAS	MME1	ECMWF_RAW	ECMWF_COR	CFSv2_RAW	CFSv2_COR	ARIMA	WARIMA
Feb I	163.8225	96.9621	131.9956	55.9008	202.7789	107.4265	92.3780
Feb II	72.1478	83.7558	114.0584	29.5072	107.0710	99.2311	92.3780
Feb III	81.4403	75.8253	103.3301	30.6811	111.3555	87.2150	85.6466
Mar I	25.0803	97.1974	107.0056	49.4504	151.6484	75.1727	84.5403
Mar II	120.4728	93.7277	103.1808	75.5555	231.6704	74.0638	80.1041
Mar III	169.5966	85.7359	94.4199	83.6646	256.5225	63.1690	44.2504
Apr I	29.1713	74.2389	60.4546	68.8359	74.1464	50.7099	40.5221
Apr II	38.4456	63.8937	52.0237	70.8249	76.2927	50.6072	8.5760
Apr III	60.3156	55.0750	44.8474	75.9045	81.7458	38.5234	38.8828
May I	23.8206	44.4667	28.0870	72.5172	34.3506	28.6324	10.1355
May II	31.2781	39.5156	24.9615	66.9831	31.7315	24.5754	24.9866
May III	32.9556	29.2346	18.4672	123.3038	58.4087	28.1022	7.3104
Jun I	19.5028	24.1109	15.9519	60.1589	27.3728	13.4558	7.2317
Jun II	8.7416	22.2295	14.7116	39.5339	17.9883	12.1809	0.0000
Jun III	5.8922	14.4046	9.5367	36.8783	16.7778	12.9941	2.9549
Jul I	10.4091	9.8209	9.3945	32.3297	22.2831	8.1527	2.2763
Jul II	15.2450	10.0609	9.6385	35.1482	24.2216	5.2905	14.9037
Jul III	19.0328	10.5429	10.0975	27.4801	18.9454	0.0000	8.1212
Aug I	1.3834	6.5196	3.7593	22.5320	8.8758	14.0602	13.5241
Aug II	3.0947	6.9658	4.0088	17.2212	6.7787	12.4973	10.8369
Aug III	7.1116	6.9975	4.0227	25.4537	9.9893	12.2050	3.6793

WEIGHTED FOR INITIAL FEBRUARY 2024

DAS	MME1	ECMWF_RAW	ECMWF_COR	CFSv2_RAW	CFSv2_COR	ARIMA	WARIMA	RRw
Feb I	25.2896762	13.8663244	23.3877699	1.3971397	35.486489	18.6607283	14.0167813	132.104908
Feb II	11.1376307	11.9777222	20.2095495	0.7374793	18.737521	17.2371305	14.0167813	94.053814
Feb III	12.5721364	10.8435998	18.3086451	0.7668188	19.487312	15.1498505	12.9954065	90.123769
Mar I	3.8717067	13.8999741	18.9598922	1.2359236	26.538606	13.0580195	12.8275444	90.391666
Mar II	18.5976780	13.4037804	18.2821913	1.8883735	40.542527	12.8653959	12.1544269	117.734373
Mar III	26.1810380	12.2608917	16.7298826	2.0910458	44.891667	10.9728935	6.7142412	119.841660
Apr I	4.5032442	10.6167324	10.7117076	1.7204292	12.975686	8.8086614	6.1485355	55.484997
Apr II	5.9349404	9.1372895	9.2178703	1.7701407	13.351291	8.7908217	1.3012613	49.503615
Apr III	9.3110653	7.8761477	7.9463306	1.8970961	14.305588	6.6917818	5.8997998	53.927809
May I	3.6772437	6.3590794	4.9766226	1.8124367	6.011386	4.9736465	1.5378887	29.348303
May II	4.8284761	5.6510341	4.4228279	1.6741218	5.553041	4.2689175	3.7912891	30.189707
May III	5.0874358	4.1807722	3.2721289	3.0817562	10.221575	4.8815471	1.1092281	31.834443
Jun I	3.0106945	3.4480437	2.8264530	1.5035632	4.790265	2.3373658	1.0972868	19.013672
Jun II	1.3494620	3.1789891	2.6066893	0.9880786	3.147969	2.1159068	0.0000000	13.387094
Jun III	0.9095932	2.0599683	1.6897695	0.9217066	2.936130	2.2571653	0.4483555	11.222689
Jul I	1.6068780	1.4044641	1.6645737	0.8080226	3.899562	1.4161805	0.3453896	11.145071
Jul II	2.3534076	1.4387859	1.7078071	0.8784659	4.238802	0.9189966	2.2613815	13.797646
Jul III	2.9381394	1.5077156	1.7891354	0.6868156	3.315462	0.0000000	1.2322532	11.469521
Aug I	0.2135588	0.9323528	0.6660953	0.5631467	1.553273	2.4423543	2.0520508	8.422832
Aug II	0.4777363	0.9961629	0.7103032	0.4304128	1.186279	2.1708677	1.6443142	7.616076
Aug III	1.0978349	1.0006962	0.7127660	0.6361693	1.748136	2.1200932	0.5582708	7.873967

The seasonal rainfall prediction value that we will use is the total of each weighted model in each dasarian. In this case we can see in the yellow column.

Jolliffe, I. T., and D. B. Stephenson. 2011. Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley. https://books.google.co.id/books?id=sgwIEAAAQBAJ.

Murphy, Allan H. 1993. “What Is a Good Forecast? An Essay on the Nature of Goodness in Weather Forecasting.” Weather and Forecasting 8 (2): 281–93. https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.

Wilks, Daniel S. 2011. Statistical Methods in the Atmospheric Sciences. Amsterdam; Boston: Elsevier Academic Press. https://www.amazon.com/Statistical-Atmospheric-Sciences-International-Geophysics/dp/0123850223/ref=pd_bxgy_14_img_3?_encoding=UTF8&psc=1&refRID=ESPQQ0R2PB1TP1VJSGCZ.

WWRP. 2017. “WWRP/WGNE Joint Working Group on Forecast Verification Research.” https://www.cawcr.gov.au/projects/verification.

Xiaoli Yang, Yuqian Wang, Xiaohan Yu, and Justin Sheffield. 2020. “The Optimal Multimodel Ensemble of Bias-Corrected CMIP5 Climate Models over China.” Journal of Hydrometeorology 21 (4): 845–63. https://doi.org/10.1175/JHM-D-19-0141.1.

Best Rainfall Prediction Model Based on Skill Score