Retrospective Evaluation of the Five-Year and Ten-Year CSEP-Italy Earthquake Forecasts

On 1 August 2009, the global Collaboratory for the Study of Earthquake Predictability (CSEP) launched a prospective and comparative earthquake predictability experiment in Italy. The goal of the CSEP-Italy experiment is to test earthquake occurrence hypotheses that have been formalized as probabilistic earthquake forecasts over temporal scales that range from days to years. In the first round of forecast submissions, members of the CSEP-Italy Working Group presented eighteen five-year and ten-year earthquake forecasts to the European CSEP Testing Center at ETH Zurich. We considered the twelve time-independent earthquake forecasts among this set and evaluated them with respect to past seismicity data from two Italian earthquake catalogs. In this article, we present the results of tests that measure the consistency of the forecasts with the past observations. Besides being an evaluation of the submitted time-independent forecasts, this exercise provided insight into a number of important issues in predictability experiments with regard to the specification of the forecasts, the performance of the tests, and the trade-off between the robustness of results and experiment duration. We conclude with suggestions for the future design of earthquake predictability experiments.

where the epicenters are smoothed using a fixed lengthscale σ; σ is optimized by minimizing the average 147 area skill score misfit function in a retrospective experiment [Zechar and Jordan, 2010a]. The density map 148 is scaled to match the average historical rate of seismicity. The two forecasts Zechar-Jordan.CPTI 149 and Zechar-Jordan.CSI were optimized for two different catalogs, while Zechar-Jordan.Hybrid is 150 a hybrid forecast. 151 3 Specification of CSEP-Italy Forecasts 152 We use the term "seismicity model" to mean a system of hypotheses and inferences that is presented as 153 a mathematical, numerical and simplified description of the process of seismicity. A "seismicity forecast" 154 is a statement about some observable aspect of seismicity that derives from a seismicity model. In the 155 context of the CSEP-Italy experiment, a seismicity forecast is a set of estimates of the expected number  196 we focused mainly on the data since the 1950s because it seems to be of higher quality [Schorlemmer 197 et al., 2010a]. We divided the period into non-overlapping ten-year periods to mimic the duration of within the testing region. Some quakes, mostly during the early part of the CPTI catalog, were not 201 assigned depths. We included these earthquakes as observations within the testing region because it is 202 very unlikely that they were deeper than 30 km (see also Schorlemmer et al. [2010a]).

209
The Poisson distribution is defined by its discrete probability mass function: where n = 0, 1, 2, ..., and λ is the rate parameter, the only parameter needed to define the distribution. 211 The expected value and the variance σ 2 P OI of the Poisson distribution are both equal to λ.

212
Because the span of time over which the CSI catalog is reliable is so short, we used the CPTI catalog 213 for the seismicity rate analysis. The sample variance of the distribution of the number of observed 214 earthquakes in the CPTI catalog over non-overlapping five-year periods from 1907 until 2006 (inclusive) 215 is σ 2 5yr = 23.73, while the sample mean is µ5yr = 8.55. For non-overlapping ten-year periods of the CPTI 216 catalog, the sample variance is σ 2 10yr = 64.54, while the sample mean is µ10yr = 17.10. Because the 217 sample variance is so much larger than the sample mean on the five-and ten-year timescales, it is clear 218 that the seismicity rate varies more widely than expected by a Poisson distribution.
219 Figure 1 shows the number of observed earthquakes in each of the twenty non-overlapping five-220 year intervals, along with the empirical cumulative distribution function. The Poisson distribution with 221 λ = µ5yr = 8.55 and its 95% confidence bounds are also shown. One should expect that one in twenty 222 data points falls outside the 95% confidence interval, but we observe four, and one of these lies outside 223 the 99.99% quantile.

224
We compared the goodness of fit of the Poisson distribution with that of a negative binomial distribu-225 tion (NBD), motivated by studies that suggest its use based on empirical and theoretical considerations 226 [Vere-Jones, 1970;Kagan, 1973;Jackson and Kagan, 1999;Kagan, 2010;Schorlemmer et al., 2010b;227 Werner et al., 2010a].The discrete negative binomial probability mass function is where n = 0, 1, 2, ..., Γ is the gamma function, 0 ≤ ν ≤ 1, and τ > 0. The average of the NBD is given while the variance is given by  Figure 1 shows the 95% confidence bounds of the fitted NBD in the number of 237 observed events (left panel), and the NBD cumulative distribution function (right panel).

238
Because the NBD is characterized by two parameters while the Poisson has only one, we used the 239 Akaike Information Criterion (AIC) [Akaike, 1974] to compare the fits: where L is the likelihood of the data given the fitted distribution and p is the number of free parameters.

241
For the five-year and ten-year intervals, the NBD fit the data better than the Poisson forecasts is too small, and they will fail more often than expected. As a result, the agreement 260 of CSEP-Italy participants to use a Poisson distribution should be viewed as problematic for time-261 independent models because no Poisson distribution that their model could produce will ever pass the tests with the expected 95% confidence rate. On the other hand, time-dependent models with vary-263 ing rate might generate an NBD over a longer time period (the empirical NBD can even be used as a 264 constraint on the model).

265
Solutions to this problem have been discussed previously. The traditional approach has been to 266 filter the data via declustering (deletion) of so-called aftershocks (as done, for instance, in the RELM 267 mainshock experiment [Field , 2007;Schorlemmer et al., 2007] Poisson forecasts, and repeat the tests with so-called NBD forecasts.

278
We created NBD forecasts for the total number of observed events by using each forecast's mean and 279 an imposed variance identical for all models, which we estimated either directly from the CPTI catalog or 280 from extrapolation assuming that the observed number of events are uncorrelated. Appendix A describes 281 the process in detail. Because the resulting NBD forecasts are tested on the same data that were used to 282 estimate the variance, one should expect that the NBD forecasts perform well. The broader NBD results 283 in less specificity, but also fewer unforeseen observations. We will re-examine this ad-hoc solution in the 284 discussion in section 7.  The N-test results in Figure 2a) show that only one forecast (Nanjo-et-al.RI) can be rejected assuming

314
Poisson confidence bounds because significantly more earthquakes were observed than expected. Using 315 NBD uncertainties, none of the forecasts can be rejected, because the confidence bounds are wider (typically by several earthquakes on both sides). 317 6.1.2 L-Test Results

318
In Figure 2b), we show the results of the unconditional and the conditional L-tests applied to the 319 original (Poisson) forecasts. We did not try to apply NBD uncertainty to the rate forecasts in each 320 space-magnitude bin, and therefore did not simulate likelihood values based on an NBD forecast. 326 expects more earthquakes than were observed during this period (although not significantly more) and 327 therefore also expects a likelihood score that is lower than observed. Moreover, the additional variability  To summarize, the conditional L-test reveals information that is separate from the N-test results and 335 presents a stricter evaluation of the forecasts. In the remainder of this article, we will only consider the One could equally construct a forecast from a "model of least information" [Evison, 1999], often called the 362 null hypothesis, which might be based on a uniform spatial distribution, a total expected rate equal to the 363 observed mean over a period prior to the target period, and a Gutenberg-Richter magnitude distribution 364 with b-value equal to one. Because several models already assume that (i) magnitudes are identically and 365 independently distributed according to the Gutenberg-Richter magnitude distribution and (ii) the total 366 expected rate is equal to the mean number of observed shocks, the only real difference between these 367 models and an uninformative forecast lies in the spatial distribution. We therefore included the likelihood 368 score of a spatially uniform forecast only in the S-test results. In Table S1 of the electronic supplement, Wiemer.ALM obtain scores that are lower than the score of a uniform model of least information.

378
In the case of the ALM group of forecasts, the low spatial likelihood scores leading to the S-test failures 379 have a common origin. In roughly one half of all spatial bins, the three forecasts expect an extremely small  marginally less than a uniform forecast, although the score is consistent with the forecast's expectation.

441
The M-test results thus far, and for all but the longest of the target periods considered below, are  Poisson assumption is appropriate with respect to observations. In section 4.3, we showed that the target 591 earthquake rate distribution is approximated better by an NBD than by a Poisson distribution. Therefore, Wiemer.ALM) consistently fail the S-tests, and often perform worse than a uniform forecast, because 688 isolated earthquakes occur in extremely low-probability "background" bins that cover roughly 50% of 689 the region. We could not identify a common characteristic among the earthquakes that occurred in back-

Value of Retrospective Evaluation
The initial submission deadline for long-term earthquake forecasts for CSEP-Italy was 1 July 2009.   1998-20021985-2003       forecasts using two separate 5-year target periods of data from the CSI earthquake catalog. Red and green symbols indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate that the Poisson forecast was rejected while the NBD forecast was passed. Black crosses: (a) expected number of earthquakes, (b) expected conditional log-likelihood score, (c) expected spatial log-likelihood score, (d) expected magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast assuming a negative binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially uniform model.  separate 10-year target periods of data from the CPTI earthquake catalog. Red and green symbols indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate that the Poisson forecast was rejected while the NBD forecast was passed. Black crosses: (a) expected number of earthquakes, (b-f) expected conditional log-likelihood score, (g-k) expected spatial log-likelihood score, (l-p) expected magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast assuming a negative binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially uniform model.  For longer time periods (e.g., the durations of the CSI and CPTI catalogs), for which we cannot esti-868 mate directly the sample variance, we used the property that the variance of a finite sum of uncorrelated 869 random variables is equal to the sum of their variances. We treated the numbers of observed earthquakes 870 as uncorrelated random variables, meaning that we assumed that the numbers of observed earthquakes 871 in adjacent time intervals are independent of each other. This is likely to be a better approximation for 872 the ten-year intervals. We computed the variance σ 2 (T ) over some finite interval of T years from the 873 reference variance σ 2 10yr using 874 σ 2 (T ) = T 10 σ 2 10yr (7)  (4) and (5). Because the direct 876 estimate of σ 2 10yr is larger than twice σ 2 5yr , it seems that there may be correlations at the five-year time 877 scale. Alternatively, the sample size may be too small, because the 95% confidence intervals are large.

Model
878 Table 4: Estimated variances of the numbers of observed earthquakes for different time intervals: *The variance was estimated directly from the catalog. Others were computed using equation (7).