The new Italian Seismic Hazard Model (MPS19)

complete list of the participants of the MPS19 Working Group reported in Appendix A in Supplementary Material Abstract We describe the main structure and outcomes of the new probabilistic seismic hazard model for Italy, MPS19 [Modello di Pericolosità Sismica, 2019]. Besides to outline the probabilistic framework adopted, the multitude of new data that have been made available after the preparation of the previous MPS04, and the set of earthquake rate and ground motion models used, we give particular emphasis to the main novelties of the modeling and the MPS19 outcomes. Specifically, we (i) introduce a novel approach to estimate and to visualize the epistemic uncertainty over the whole country; (ii) assign weights to each model components (earthquake rate and ground motion models) according to a quantitative testing phase and structured experts’ elicitation sessions; (iii) test (retrospectively) the MPS19 outcomes with the horizontal peak ground acceleration observed in the last decades, and the macroseismic intensities of the last centuries; (iv) introduce a pioneering approach to build MPS19_cluster, which accounts for the effect of earthquakes that have been removed by declustering. Finally, to make the interpretation of MPS19 outcomes easier for a wide range of possible stakeholders, we represent the final result also in terms of probability to exceed 0.15 g in 50 years.


Introduction
In 2015, the Seismic Hazard Center (Centro di Pericolosità Sismica, CPS) of the Istituto Nazionale di Geofisica e Vulcanologia (INGV) was commissioned by the Italian Civil Protection Department to engage and coordinate the national scientific community with the aim of elaborating a new seismic hazard model for Italy (Modello di Pericolosità Sismica, MPS19 hereafter). MPS19 takes into account the large amount of new data and methods that have been published after the last release of MPS [MPS04; http://zonesismiche.mi.ingv.it; Stucchi et al., 2011], and it is developed according to the most advanced international standards in seismic hazard [Gerstenberger et al., 2020].
MPS19 will be taken into consideration to update the Italian building code [NTC, 2018], currently based on MPS04 model. For this reason, since the beginning CPS representatives met the national earthquake engineering community to discuss the most proper outcomes of the new seismic hazard model for the building code purpose. Specifically, the model has to consider only declustered seismicity, to cover the whole national territory, the reference soil is rock (i.e. EC8 site category A, V s,30 >800m/s); the hazard has to be expressed in terms of mean horizontal peak ground acceleration (PGA), peak ground velocity (PGV), and peak ground displacement (PGD), spectral acceleration (SA), velocity, displacement and macroseismic intensity (the first release described here contains only PGA and SA results); the oscillatory period of the spectral ordinates has to be in the range between 0.05 and 4 seconds, and the return periods from 30 to 5000 years.
The definition of MPS19 has engaged about one hundred of researchers from Institutes and Universities through an open call to the Italian scientific community (see Appendix A in Supplementary Material). Such a complex endeavor has been completed in 2019, after a successful interaction with a participatory review panel, which gave a final positive approval to MPS19. At the time of writing this paper, applications for building code purposes, or for any other kind of risk analysis are still under discussion. In this paper, we summarize the scientific background of MPS19, the overall structure of the model, the novelties introduced, and the main outcomes. The specific description of the components is (or will be) covered in different papers.
MPS19 is based on a Probabilistic Seismic Hazard Analysis [PSHA, Cornell, 1968], i.e., MPS19 describes in a probabilistic way the forecast of a variety of ground motion intensity measures (IMs; such as PGA and other quantities) on the Italian territory, and it is rooted in open and transparent procedures that guarantee completely reproducible outcomes; specifically, we use the open-source OpenQuake engine platform [Pagani et al., 2014] and all data will be available through a dedicated website. Before getting into the details, there are some features and novelties introduced in MPS19 that deserve to be highlighted, as they mark a difference with previous similar activities.
• MPS19 honors the hazard/risk separation principle , i.e., the hazard is evaluated without being conditioned by the effects on the risk. In other words, MPS19 is only focused on providing the forecast of the IM (i.e., the likelihood of observing different levels of IM) in a time window of 50 years, describing at best the historical seismicity and ground motion; purposely, we do not evaluate the impact on the risk, and/or introduce any additional subjective "precautionary" choices that, according to the hazard/risk separation principle, should be introduced only at the risk level. • MPS19 relies on a clear and coherent probabilistic framework 2018], which allows a univocal distinction of aleatory variability and epistemic uncertainty, a meaningful testing phase (to find possible ontological errors), and a clear interpretation of the mean and percentiles of the hazard outcomes. Then, we use an ensemble modeling strategy to get a complete description of the seismic hazard [Marzocchi et al., 2015].
• MPS19 includes an extensive testing phase to assign weights to each component of the model, to verify the reliability of the earthquake rate models (ERMs hereafter) and of ground motion models (GMMs hereafter), and of the hazard outcomes in terms of PGA and macroseismic intensity at several sites and, on average, over the whole territory.
• Large aftershocks and foreshocks may affect significantly the seismic hazard and risk, but they are usually removed by classical declustering techniques to get a set of time-independent (Poissonian distributed) earthquakes [Cornell, 1968]. For this reason, we also estimate a version of our model introducing a pioneering attempt to correct the classical PSHA model to account for earthquake clustering [Iervolino et al., 2012;Marzocchi and Taroni, 2014].
The roadmap followed to build MPS19 consists of few basic steps which will be outlined in this paper: i) define a coherent probabilistic framework to fully describe the uncertainties (ensemble modeling), to check the consistency and relative predictive skill of each component of MPS19, and to test the consistency of the final MPS19 outcome (section 2); ii) improve the quality and accuracy of the input data (e.g. seismic catalog, seismotectonic zonation, etc.; section 3); iii) build ERMs based on these new input data, and test their consistency and relative predictive skill (section 4); iv) select the most proper GMMs for Italy, and estimate their relative predictive skill (section 5); v) weigh ERMs and GMMs through a combination of formal experts' judgment elicitations with the results of the testing phase (section 6); vi) show the outcomes of MPS19 and the results of the testing phase (section 7); Carlo Meletti et al.
vii) introduce pilot procedures to build a national seismic hazard model which takes into account the effects of aftershocks and foreshocks that have been removed by the declustering technique (MPS19_cluster; section 8).
Finally, in section 9, we summarize the main novelties of the model, the open questions, and future developments.

The probabilistic framework and the testing phase
PSHA combines ERMs and GMMs to estimate the hazard curve in each site, i.e., the probabilistic function of the IM in a given time interval. To acknowledge the role of pervasive uncertainties and their importance, SSHAC [1997] states that PSHA should "represent the center, the body, and the range of technical interpretations that the larger technical community would have if they were to conduct the study." Handling uncertainties of different kinds is far from being trivial. For example, in PSHA it is common to separate aleatory variability and epistemic uncertainty even though a satisfactory definition has never been provided [SSHAC, 1997;NRC, 1997]. In MPS19 we adopt a probabilistic framework which makes this separation possible and external to the model . The framework is rooted in a univocal hierarchy of uncertainties which allows a clear interpretation of what a probability is (e.g., frequency or degree of belief), a meaningful testing of the PSHA model, and a clear interpretation of the mean hazard and of the percentiles that have been matter of intense discussion [Musson, 2005;McGuire et al., 2005;Abrahamson and Bommer, 2005]. The details of the framework can be found in Marzocchi and Jordan [2014;. Here we just summarize the most important feature in the PSHA context: the probability of exceedance (PoE), which characterizes the hazard in one site, is the frequency of an infinite sequence of yearly IM exceedances that are considered exchangeable (i.e., the sequence of exceedances can be reshuffled without losing information); the true frequency is the aleatory variability of the data-generating process and it is unknown (we could know it only if we have the "true" hazard model). We estimate this unknown aleatory variability using different models (or different parametrizations of the same model); the dispersion of these models describes the epistemic uncertainty and allows bounding where the true frequency (aleatory variability) should be. For example, every single model gives an estimation of the PoE for one selected IM, and the PoE of all models is described by a distribution; the center of this distribution is the best guess of the aleatory variability, and the dispersion mimics the epistemic uncertainty (an example of this distribution will be shown in section 6: "Weighing and merging all MPS19 components"); if the observed frequency of future data does not belong to this distribution, it means that we have found an ontological error, i.e., using the Rumsfeld's words, "the presence of unknown unknowns"; this test is what Marzocchi and Jordan [2014] define as model validation (see Figure 4 in Marzocchi and Jordan [2017], for a visual representation of the problem).
This probabilistic framework allows us to use a meaningful testing phase to check the reliability of MPS19. In past common practice, the reliability of a seismic hazard model is based on the consensus of the scientific community on the model's outcomes [Marzocchi and Zechar, 2011], even though a full consensus is impossible, and the definition of "scientific consensus" is intrinsically fuzzy [Aspinall, 2010]. The testing phase is a more objective approach that may contribute to check the reliability of a model. Owing to the length of the time window considered (50 years), seismologists cannot usually validate models using independent data, even though some efforts in this direction are ongoing, with some limitations, and usually using earthquakes from small to moderate magnitudes [e.g. Mak et al., 2014;Brooks et al, 2017;Mousavi and Beroza, 2018]. Most of the time, as for MPS19, only past data, which have been used more or less directly to build the hazard model, can be considered for testing [e.g. Albarello and D'Amico, 2008;Stirling and Gerstenberger, 2010;Mak and Schorlemmer, 2016]. Hence, instead of using the term "validation", we prefer to use the less ambitious term "consistency" of the model's output with data. In other words, the model should be consistent with the data that were available when the model was built.
At the same time, the adopted framework outlines a clear pathway to validate the model with future observations. We test the MPS19 outcomes, ERMs and GMMs to check their consistency with the available data, and their relative forecasting skill. To this purpose, we adopt some tests that have been developed in the Collaboratory for the Study of Earthquake Predictability (CSEP) framework [Jordan, 2006;Zechar et al., 2010a;Schorlemmer et al., 2018]. Specifically, we check the consistency of a model through the Nand S-test [Schorlemmer et al., 2007;Zechar et al., 2010b], which have been modified to account for the epistemic uncertainty [Marzocchi and Jordan, 2018].
The N-test compares the total number of earthquakes forecast with the observed number over the entire region. The S-test is similar to a classical likelihood test, but it is applied to a normalized forecast isolating the spatial earthquakes and their location.
The forecasting skill of the models can be quantified through one or more scoring schemes. We argue that the wide proliferation of scoring schemes in scientific literature is due to the fact that a perfect single scoring system does not exist, but each one of them ranks models according to a specific metric, or loss function. Here we use two proper scoring methods [Scherbaum et al., 2009;Zechar and Zhuang, 2014] that explore different aspects of the models: the parimutuel gambling score which tends to reward a model that produces forecasts often close to the observations, and the Log-likelihood score which rewards models that never failed badly (one model that predicts perfectly all the events but one, at which it assigns zero probability, will have the worst score possible).
To sum up, for the testing phase, we: • test the consistency of every single ERM, comparing its rates with the observed seismicity rates of the last centuries (taken from the CPTI15 catalog ). An ERM that does not pass both consistency tests is not included in MPS19.
• score ERMs and GMMs, comparing all models against each other using different metrics to rank them according to their retrospective forecasting (hereafter hindcasting) skill.
• evaluate the overall consistency of MPS19 with the past accelerometric data that are available in Italy, accounting for the epistemic uncertainty. In other words, we test if the numbers of exceedances recorded in the last decades are in agreement with what expected by MPS19.
• test MPS19 using macroseismic intensity data. Specifically, we test if the number of the macroseismic intensities 6 MCS (Mercalli-Cancani-Sieberg scale [Sieberg, 1932]) and larger observed in the last centuries is consistent with the MPS19 outcomes.
A full description of the tests is beyond the scope of this paper and has been thoroughly discussed in Gneiting and Raftery [2007], Schorlemmer et al. [2007], Zechar et al. [2010b], and Marzocchi and Jordan [2018]. In this paper we just report the main results of the testing phase.

The input data
MPS19 takes advantage of a collection of new datasets (see Table 1) that have been used to build and test ERMs and GMMs. They include, among others, an updated historical-instrumental earthquake catalog [CPTI15 v. 1.5, Rovida et al., 2016;, an updated database of seismogenic sources [DISS 3.2.1, DISS Working Group, 2018], a new harmonized GPS velocity model for the Mediterranean area [Devoti et al., 2017], a new flatfile of strong motion recordings for the moment magnitude M w ≥ 4.0 earthquakes for the selection and test of the GMMs [Lanzano et al., 2019; see Section 5] and two assessments of the focal mechanism of expected earthquakes [Roselli et al., 2018;Pondrelli et al., 2020]. To be used in MPS19, all datasets cover the whole Italian territory, with the exception of those referred to volcanic areas. Although it was not prescriptive, these datasets have been used by several modelers to build their ERMs. One important dataset is the CPTI15 catalog, which represents the common input to almost all ERMs, but those that use strain rate field derived from geodesy as main input. CPTI15 contains the parameters of more than 4500 earthquakes, from 1000 CE to 2014, determined with homogeneous criteria merging updated macroseismic intensity data of past earthquakes and instrumental determinations of modern ones, as described in Rovida et al. [2020].

N. Dataset Description/notes References
Owing to major improvements with respect to previous versions, we argue that the updated catalog turns out to be one of the most impacting factors to explain the differences of MPS19 with previous Italian PSHA models. Among the major improvements, macroseismic magnitudes (which are meant to estimate the moment magnitude) derived only from epicentral intensity are replaced in CPTI15 with magnitudes that are obtained by a new and updated intensity data distribution. The novelties of the catalog lead to a magnitude-frequency distribution that is more realistic than in previous versions of the catalog (i.e., the Gutenberg-Richter law has the b-value close to 1), and to the removal of spurious temporal patterns which characterize the time evolution of the seismicity in previous versions of the catalog. Figure 1 shows how the earthquakes magnitudes are changed between CPTI04 ([Gruppo di Lavoro, 2004] adopted in MPS04) and CPTI15; in particular, a substantial magnitude reduction is evident for the magnitude range between 4.5 and 5.8.
Furthermore, we provide additional pieces of information which have been used by several ERM modelers: i) a model for the maximum magnitude , ii) a determination of the seismogenic layer and of the hypocentral depths , and iii) the completeness intervals of the CPTI15 catalog and its declustered version . For the purpose of MPS19, we estimate the maximum magnitude in 18 macro-areas with homogenous tectonic features. The maximum magnitude is evaluated by comparing the observed magnitudes in the catalog (spanning approximatively 1000 years) with the magnitudes calculated from geological information in the DISS 3.2.1. Within each macro-area, we initially define the maximum observed magnitude as the magnitude of the strongest event in the CPTI15 catalog, considering a minimum level of uncertainty for the magnitude evaluation of 0.2 and 0.3, respectively for the instrumental (post 1980 AD) and historical (pre 1980 AD) earthquakes. This magnitude is then compared to the magnitudes of the composite seismogenic sources of the DISS 3.2.1, and the largest between the two values is taken as representative of the maximum magnitude of each macroarea. Then, according to the procedure adopted in the European Seismic Hazard Model ESHM13 [Woessner et al., 2015], we define the lower threshold for the maximum magnitude as 6.5 for active shallow crustal regions (ASCR), 6.0 for stable continental regions (SCR), and 5.6 for Etna volcanic area. In lack of better qualitative and quantitative data coverage, we define a second and highest value of maximum magnitude, adding a uniform magnitude increment of 0.3 (except for the Etna area where we defined it to be 0), to account for epistemic uncertainty.
The same macro-areas defined for the maximum magnitude are used to assess the depth of the seismogenic layer. Following Boncio et al. [2009], in each macro-area we first calculate the depth distribution of the earthquakes in the instrumental catalog 1981-2015 (dataset 3 in Table 1; Lolli et al. [2020]). Then, for the events within the 5th and 95th percentiles, we evaluate the parameters of both a unimodal and a bimodal distributions, selecting the best one according to the Akaike Information Criterion, AIC [Akaike, 1974]. Finally, we check the stability of the results using different magnitude thresholds (M w ≥ 2.0, 2.5, 3.0, 4.0).
For seismic hazard analysis purposes, we distributed a version of the catalog in which every earthquake is marked as mainshock or aftershock, with the starting years of the complete periods for its magnitude according to the two approaches described below, and the completeness region it belongs to. Since in PSHA an earthquake catalog is declustered to get the mainshocks distributed in time according to Poisson [Lomnitz, 1966], we use GK74 algorithm [Gardner and Knopoff, 1974] that is the most effective to achieve this goal [van Stiphout et al., 2011;Luen and Stark, 2012]. The use of the general GK74 technique is justified by a recent empirical analysis of worldwide seismic sequences [Stallone and Marzocchi, 2019], which shows that earthquake clustering is similar in different crustal tectonic regions.
Independently from the declustering technique used, we underline that its effect on the seismic hazard could be significantly mitigated adding a suitable correction for declustering. We discuss this point in section 8.
As regards the completeness of the catalog we use the same approaches adopted in MPS04 [MPS Working Group, 2004, Stucchi et al., 2011, which are based on the historical and statistical procedures described in Stucchi et al.
[2004] and Albarello et al. [2001], respectively. To match the new features of CPTI15 with the previous historical completeness estimates, we i) redefine the M w bins according to the new empirical conversion relation between epicentral intensity and magnitude used in CPTI15 ; ii) evaluate the completeness intervals for epicentral intensity < 6, because such intensities were not included in the catalog used for MPS04; iii) revise the geometry of the regions with uniform completeness, and add one more that covers the off-shore areas. As in MPS04, we assess completeness for M w bins of 0.23 M w units, to avoid the oversampling of some M w intervals that contain values derived from the conversion of more than one discrete epicentral intensity values [see Stucchi et al., 2011].
As regards the statistical completeness, we apply the statistical analysis of Albarello et al. [2001] in the same regions Carlo Meletti et al.

The ERMs
The researchers who answered to the open call spontaneously organized themselves in working groups and provided 11 independent ERMs for the whole Italian territory, and one model for the Etna volcano. An ERM forecasts the earthquake rate for different magnitudes, locations, geometry and kinematics. The models were built independently on purpose, to minimize the possibility that single models would be adjusted towards the mean consensus value, which unavoidably leads to underestimate the real uncertainty. The 11 ERMs are grouped according to the main typology of seismogenic sources: 5 area source ERMs (MA1 to MA5) based on the definition of areal sources for which seismicity rates are defined; 2 smoothed, zonation-free ERMs (MS1, MS2) that model the spatial density distributions of earthquakes in seismic catalogs; 2 fault sources plus background ERMs (MF1, MF2), based on the identification of seismogenic sources using tectonic and geophysical data; and 2 geodetically-derived, zonation-free ERMs (MG1, MG2) that model seismicity rates derived from the geodetic strain rate field. The last two models, which are mostly based on ground deformation, provide a different and independent level of information with the respect to the other ERMs because they are almost completely independent from CPTI15.
In  Each ERM is deemed appropriate to be included in MPS19 if it i) passes consistency tests described in Section 2, and ii) is considered physically plausible from a group of independent experts through a formal experts' elicitation, which is described in Section 6. At first, we carry out statistical tests on a national scale, to define if an ERM is appropriate to be included in MPS19, and to measure the relative hindcasting skill which is used to weigh each single ERM (see Section 6). Following a suggestion of the participatory review panel (see Appendixes A and B in the Supplementary Material) we carry out the consistency tests of ERMs also on regions identified by a group of experts according to the following criteria: separate regions with high and low number of geological data, earthquakes, and geodetic stations; minimize the overlapping of areas with different completeness; keep the areas as large as possible avoiding marked tectonics inhomogeneity; minimize the off-shore part included in the regions. Following these criteria, we define 7 macro-areas (Figure 2).
Testing ERMs in each region allows us to check if their hindcasting skill varies with space and/or with the number of input data. If an ERM does not pass both consistency tests in one region, its overall earthquake rate is replaced in that subregion by the weighted average of the other consistent ERMs, preserving the geometries, depth, and focal mechanisms of the model. All models of Table 2 passed both consistency tests at national scale, hence they are deemed to be included in MPS19. However, as shown in Table 3, some of them failed to describe satisfactorily the data in some macro-areas of Figure 2.  Figure 2) in which it does not pass the test; "None" means that the model passes the test in all macro-areas.
ERMs in Table 2   ERMs considered in MPS19 and Figure 4 shows the epistemic uncertainty of the rates. The mean rates of the 11 ERMs lie in a range from ~4 eq/yr to ~8eq/yr. Lowest and highest rates are ~3eq/yr and ~17eq/yr, respectively from a fault-based and an area-source ERM. Considering the overall weighed distribution of the rates, 2.5 th -97.5 th and 16 th -84 th percentiles are, respectively, ~4.4 eq/yr -~8.6 eq/yr, and ~4.8 eq/yr -~6.9 eq/yr.

The GMMs
This section summarizes the activity of the selection and weight of GMMs for MPS19, that has been exhaustively treated in Lanzano et al. [2020]. The candidate GMMs potentially suitable for Italy have to satisfy the following general conditions: i) representation of the characteristics of ground motion for whole Italian territory; ii) calibration for the engineering bedrock (i.e. EC8 site category A, V s,30 >800 m/s) and flat topography conditions; iii) prediction of the horizontal PGA and 12 ordinates of the acceleration response spectra (5% damping), at periods T 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1, 2, 3, 4 seconds; iv) horizontal motion in terms of geometrical mean of the horizontal components.
We identify three main tectonic regions, which are characterized by peculiar ground motion attenuation properties: i) Active Shallow Crustal Regions (ASCRs); ii) Volcanic Zones (VZs); and iii) Subduction Zone (SZ), in the southern Tyrrhenian sea. A preliminary selection of candidate GMMs is performed over nearly one thousand models published in scientific literature, which are listed in Douglas [2019]. In order to reduce the number of available models, we adopt the rejecting criteria by Cotton et al. [2006] and Bommer et al. [2010], and then add further projectspecific criteria to fulfil the requirements of MPS19 listed above [see Lanzano et al., 2020 for details]. The candidate models for ASCRs after this pre-selection are shown in Table 4, and Figure 5 displays the trellis plots of their spectral acceleration predictions, assuming rock-site conditions (EC8 site category A or V s,30 >800m/s) and normal faulting.  12 Figure 5. Trellis plot of the median values of the GMMs pre-selected for ASCRs (see Table 4 for the abbreviations).  Luzi et al., 2016]. The testing datasets for ASCRs and SZ are provided and described in Lanzano et al. [2020]. We verify the stability of the hindcasting skill of candidate GMMs, and examine their behavior on the basis of the requirements of MPS19, using 4 different subsets: i) the whole dataset (named All); ii) a subset of records with stations classified as EC8 site class A (named EC8-A); iii) a subset of events with M w ≥ 5 and stations located at R JB < 50 km (named MR), since these magnitude and distance ranges are those mostly affecting the hazard in Italy; iv) a validation subset without the Italian records preceding 2009 (named After 2009), to obtain a dataset independent of that used to calibrate the ITA10 [Bindi et al., 2011] model. The last subset is the only one which allows a truly independent comparison of the GMMs; this is essential to quantify possible overfitting effects, which could favor those GMMs built using, at a different extent, the data from ITACA.
For the purpose of measuring the relative GMM hindcasting skill, we consider the proper scoring methods described in Section 2, the Log-likelihood (LLH), proposed by Scherbaum et al. [2009], and the parimutuel gambling score [Zechar and Zhuang, 2014]; we also consider an additional method, i.e. the quantile score [Gneiting and Raftery, 2007], to check the consistency of the results. The details of the scoring procedure and the scoring values for each GMM are reported and discussed in Lanzano et al. [2020]; here in Table 5 we show the average values of the scores, and the un-normalized weight for each GMM using the EC8-A subset; the un-normalized weight is the ordinal number of the score. Summing up the total un-normalized scores of EC8-A and MR subsets, the grand-total 3 The AB10 predictions at 4 s (not provided by the Authors) are linearly extrapolated from those at 2 and 3 s. 4 CZ15 is provided in terms of ordinates of the displacement response spectra (SD), and the predictions have been converted to SA by multiplying for (2π/T) 2 . 5 Most of the GMMs are also calibrated for PGV, with the exception of the model IDR14 and ZHA06. Worthy of note, these GMMs, besides showing the best relative hindcasting skill, use different metrics for the distance (i.e., R JB , R hyp , R rup , respectively) and are calibrated on different datasets (i.e., Italian, European, and global, respectively).

Weighing and merging all MPS19 components
It is essential to weigh each ERM and GMM to properly merge them in the final MPS19. The weighing procedure is often based on unstructured experts' judgment [Gerstenberger et al., 2020]. Here we aim at providing a more objective procedure that could make the weighing scheme more sound, transparent and reproducible.
The final weight of each component is assigned accounting for its statistical scoring (described in the sections 4 and 5), and the outcomes of an experts' elicitation session with independent national and international experts.
As a matter of fact, the weight of each model (either ERM or GMM) is a combination of two different evaluations: where C is a normalization factor that sets W in the range (0,1) and that the sum of all weights is 1; specifically, C is the sum of W 1 x W 2 for all models. The need of merging these two different scores is due to the fact that the number of data for the testing phase is limited and the test is made retrospectively (so models that tend to overfit the data are more rewarded than others in the testing phase), and that the earthquake catalog could not cover the whole seismic cycle; in this case, the experts' opinion is expected to provide an independent additional information for weighing models [Delavaud et al., 2012]. Note that equation (1) provides zero weight to models that are either not able to explain the data, or unrealistic, e.g. based on assumptions that are blatantly wrong.
In Tables 6 and 7 we show the weights of the ERMs and GMMs, which take into account the statistical scoring and the experts' judgment. In Appendix E of the Supplementary Material we report the description of the experts' elicitation sessions.  To summarize, the weighing procedure adopted is as much objective and auditable as possible; it considers independent information, such as the hindcasting performance of the models and the experts' judgment on the scientific reliability of the models. Nonetheless, the procedure has also some limitations, which are mostly due to the finite number of data, and the impossibility to carry out more specific tests at regional scale, or tests that depend on the earthquake magnitude. Here we report a list of these limitations.
i. The statistical procedure to select and to score ERMs and partially GMMs does not take into account that some models were calibrated with the same data used for testing, while others do not. This means that the latter are probably underscored (and underweighted).
ii. The weight of each ERM and GMM is constant for the whole country, whereas some ERMs are not used in some sub-regions where they failed the tests.
iii. Owing to the limited number of data, the testing procedure is limited only to some specific aspects of the earthquake occurrence and ground motion processes. For example, the data available do not allow us to test ERMs for M7+ earthquakes, or to compare GMMs performance in the very near field (less than 10 km) which are very relevant for seismic hazard.
iv. The procedure carries on an unavoidable subjectivity in merging the different scoring metrics of the testing phase and the scoring of the experts' judgment. Other procedures to merge these scores are possible.
v. The weights based on the testing phase do not take into account that some ERMs may perform poorly because of uncertainties in the oldest earthquakes of the seismic catalog, such as large uncertain location of off-shore and/or very old earthquakes, and the magnitude of historical events which may be affected by cumulative damages caused by seismic sequences. This is unavoidable, but it highlights again the need to not weigh a model only using its hindcasting performance.
Once the ERMs and GMMs weights have been defined, the seismic hazard at each site is estimated by hazard curves, where is the number of ERMs and GMMs combinations. Worthy of mention, this set of hazard curves is expected to sample the epistemic uncertainty, and a higher number of models cannot reduce it. In fact, even though models are developed independently from one another, almost all of them use the same amount of information ].
If we dissect the bunch of curves for one specific IM value, we get a set of estimations of the PoE for that specific IM value ( Figure 6). We model them using a Beta distribution [Marzocchi et al., 2015], which describes where the true PoE is expected to be . The choice of using the Beta distribution is subjective and other ones may be used [see Gelman et al., 2003], keeping the same probabilistic framework adopted here. That said, we argue that this choice is certainly less subjective than describing the hazard using the mean alone, which would imply the use of a Dirac distribution, or using the percentiles of the distribution, which implicitly means to impose a nonparametric distribution. The Beta distribution usually fits well aleatory variables with a unimodal distribution and bounded between 0 and 1; in case the data show more modes [for examples 2 modes as in Figure 6 of Marzocchi et al., 2015], we can still use the Beta distribution if we think the bimodality is only due to the fact that the models used are just exploring two extreme scenarios, and a proper epistemic uncertainty has to fill the gap between the modes. To sum up, no matter what distribution we decide to use, this choice is certainly less critical than using one single value for the hazard.

The MPS19 outcomes and consistency test
Regional seismic hazard is usually presented as a map showing the PGA values that have 10% PoE in 50 years.
This representation is of particular interest to the engineering community which establishes this hazard level to be of interest for the building code. However, this representation is often misinterpreted by laymen and some endusers, which assume that this map is reporting the maximum PGA that can occur in each area. In fact, although it is quite obvious to observe PGA exceedances just close to the epicenter of a large earthquake, this observation is often deemed as a proof that the probabilistic model is wrong. For this reason, we propose here an additional graphical representation of MPS19 that may be easier interpreted by the non-engineering community. Specifically, of 0.15 g has been adopted as a criterion for significant shaking, and other values can be used [WGCEP, 1995]. Figures   7b and 7c show the more classical representation of MPS19, in terms of PGA with 10% and 2% PoE in 50 years.
One of the most innovative aspects of MPS19 concerns the quantification of the epistemic uncertainties of the final hazard estimates. As mentioned above, for the probabilistic scheme used, the exceedance probability at one site for one specific IM is not a single number, but it is a Beta distribution, the dispersion of which describes the epistemic uncertainty on the true exceedance frequency (random variability). One way to represent epistemic uncertainties on the whole national territory is given by the mapping of the CoV, i.e. coefficient of variation.
CoV is given by

The new Italian Seismic Hazard Model
where is the weighted standard deviation of all the hazard estimates, and is the weighted average value among all estimates (these parameters are used to calculate the parameters of the Beta distribution). A CoV of 0.3 indicates that the standard deviation between all the models used (including the uncertainties within each model) is 30% of the average. Figure 8 shows the CoV map for the whole national territory relating to PGA values which have the exceedance probability of 10% in 50 years (Figure 7b). From the figure it can be seen that the highest values of the CoV are for areas with low seismic hazard. When we consider only areas with high hazard (i.e. those with a higher , panel (c)), the CoV mainly varies between 0.15 and 0.30. The fact that the CoV is higher for low hazard areas is due to the fact that in those areas the information available are relatively scarce, and so it is understandable to have greater uncertainties. It is worth remarking that this result poses a great challenge to the use of MPS19 (as well as of any other model built on the same amount of information) for the definition of the building code. As a matter of fact, the current Italian building code is based on one hazard input having 4 decimal digits and it does not consider epistemic uncertainty; de facto, the latter is handled by engineers introducing some arbitrary conservative parameters. Figure   8 shows that this request of precision does not match the current epistemic uncertainty in PSHA. In practice this means that even in the most optimistic (and likely unrealistic) option where MPS19 is "right", the mean values of future PSHA models considering more data are expected to vary according to the range given by the CoV. In essence the CoV represents the lower bound of the expected future variations of the mean hazard.
As regards the test of MPS19 against observations, in Figure 9 we show the results of the consistency test of We consider accelerometric stations on rock (EC8 A site category) and on all types of terrain (EC8 A, B, C and D) with flat topographical surface (i.e., T1 category according to NTC18 Italian building code; NTC [2018]). In case the station is not located on rock, the PGA observation is scaled down by the weighted average of the site coefficients of the 3 GMMs selected for ASCRs according to their weights. For rock sites we consider all stations, either with a good site characterization (type A), or with more uncertain site characterization (named type A*). When considering all types of terrain, we use only the stations with a good site characterization. The IM observed at different sites is correlated [Park et al., 2007], and this may be an issue for testing [Iervolino et al., 2017]. We adopt a pragmatic approach choosing sites that have a minimum distance of 30 km from one another. In this case, although part of the IM correlation remains independently from the distance [Park et al., 2007], we may assume that, for small PoE, exceedances are independent, i.e., the observation of an exceedance E Y in one site Y does not modify the probability to observe an exceedance E X in the nearby site X, ( | ) = ( ( ) ≡ PoE. Hence, the exceedances follow a binomial distribution with the parameter equal to the selected PoE. Worthy of note, this assumption makes the test of PGA conservative: if the null hypothesis (the exceedances are generated by a process which is well described by MPS19) is not rejected, it will not rejected also relaxing the assumption, i.e., using a more exact distribution for the number of exceedances, which is overdispersed with respect to the binomial distribution [see Figure 2 in Iervolino et al., 2019]. In the analysis reported in Figure 9, we show the results of the test when the epistemic uncertainty of the PoE at each site is not included, which means to not consider possible common mode errors [Marzocchi and Jordan, 2018]. This also makes the test very conservative for our purposes, i.e., a non-rejection means that it will be also a non-rejection if uncertainties are included.
MPS19 is also tested against the macroseismic data observed in Italy as reported in the database DBMI15 v1.5 . In this way, we can use a much longer time period of observations, i.e., several centuries. Since MPS19 does not currently produce estimates in macroseismic intensity, the first step is to convert the PGA exceedance rates λ (x) where x is the PGA, in terms of rates of exceedance of macroseismic intensities λ(I). This is done using the procedure suggested by D'Amico and Albarello [2008] ( 3) where ( | ) is the empirical function for converting PGA into macroseismic intensity, and N is the number of PGA levels considered in the hazard curves of the model. Obviously, the passage introduces a further source of possible ontological error, as the empirical function for transforming PGA into macroseismic intensity is subject to high uncertainty. For this test, we account for these uncertainties using two functions estimated by Faenza and Michelini [2010;, and Gomez Capera et al. [2020].
We select 27 sites that have a long seismic history for the analysis. The number of observations at each site takes into account the completeness of the data and considers only macroseismic intensities equal to 6 or higher which have been generated by mainshocks (macroseismic intensities associated to foreshocks and aftershocks are removed). To avoid the spatial correlation of macroseismic intensities, the distance between any couple of sites is always larger than 30 km (see the discussion above for testing the PGA exceedances), and we keep only one intensity (the maximum observed) for each earthquake; being generated by mainshocks following the Poisson distribution, we assume that these macroseismic observations are independent and follow the same distribution. This assumption facilitates the consistency test phase since we can test with aggregate observations. The procedure is similar to that described for the PGA exceedances. In this case, the epistemic uncertainty is taken into account, that is, ( ) is the pdf Gamma of the parameter , which is estimated by the different (weighted) rates of all the models. Since there is considerable uncertainty and variability on the type of soil where macroseismic intensities have been observed, the test is carried out for two different types of soil: A and C.
Since all the GMMs used in MPS19 assume a linear behavior of soils, to estimate the PGA for type C soils the hazard curve calculated for type A soils is translated by a factor derived for Italy from the GMM by Bindi et al. [2011].
Note that the test for type A soils represents an underestimation of the number of exceedances because it does not take into account any site amplification.

A pilot modeling: MPS19_cluster
One of the basic assumptions in PSHA is that the earthquake occurrence has to be stationary in time and Poisson distributed [Cornell, 1968]. To satisfy this assumption seismic catalogs are commonly declustered, i.e., clusters of seismicity, which represent a clear deviation from the Poisson distribution, have to be removed. The basic rationale behind this choice is that by keeping the largest events we remove earthquakes that have a negligible contribution in terms of IMs and seismic hazard [Lomnitz, 1966]  exists (yet we are not aware of any convincing physical reason that allows us to postulate the existence of such unknown sharp distinction between mainshocks and all other earthquakes), the epistemic uncertainty cannot be estimated by the variability generated by the current declustering techniques.
In MPS19 we use GK74 declustering technique, because it is the most efficient procedure to get a mainshocks distribution similar to Poisson distribution (see Section 3). No matter of the declustering method used, this operation implies that strong ground shaking caused by aftershocks and foreshocks is not considered in classical PSHA [Boyd, 2012;Iervolino et al., 2012;Marzocchi and Taroni, 2014] and risk [Papadopoulos et al., 2020]. For this reason, we build a version of MPS19 which includes the effects of earthquake clustering (hereafter MPS19_cluster) and allows us to quantify its impact in terms of seismic hazard. MPS19_cluster is rooted in two different strategies, which have been described by Iervolino et al. [2012] and Marzocchi and Taroni [2014]. For the sake of example, in Figure 11 we show the MPS19_cluster obtained by applying only the procedure described by Marzocchi and Taroni [2014] on the mean ERM used in MPS19. The correction is magnitude-dependent in order to restore the magnitudefrequency distribution of the earthquakes before declustering, and it is applied to each spatial bin.
This figure intends to give a flavor of what the impact may be in the PGA estimated for 10% and 2% PoE in 50 years of MPS19 (Figure 7b and 7c respectively); from the figure we notice that the mean hazard could increase of about 20-25% in the regions of higher hazard for 10% PoE in 50 years with some increases of more than 30%. For the 2% PoE in 50 years, the increase is less strong being about 10-15% with some increases of about 25%.
A similar work is carried out also using the procedure described by Iervolino et al. [2012]; the outcomes will be described in a separate paper [Chioccarelli et al., 2021].  Figure 7b and 7c, respectively, according to the procedure described by Marzocchi and Taroni [2014]. The maps show only the sites with a PGA above the 25 th percentile.

Final remarks, highlights, and open questions
MPS19 uses all new available data at the best and implements the most advanced practice introducing several novelties that are worth being remarked: • MPS19 honors the hazard/risk separation principle, i.e., it does not include any subjective "precautionary" choices that should be introduced only at the risk level. In other words, all choices made in MPS19 are based on scientific thoughts, independently from the impact on the seismic hazard.
• MPS19 shows the aleatory variability and the epistemic uncertainty in the final output, making the interpretation of the dispersion of the hazard curves more clear, and providing the interpretation for any possible end-user.
• MPS19 has been checked through an extensive testing phase to assign weights to each component of the model, and to verify the reliability of the ERMs, GMMs and of the hazard outcomes in terms of PGA and macroseismic intensity at several sites and, on average, over the whole territory.
• We introduce MPS19_cluster, which is a pioneering attempt to correct the classical PSHA model to account for the large foreshocks and aftershocks that have been removed by the declustering technique.
At the same time, the work that has been made for MPS19 have highlighted also some challenges that should be addressed in the future. Some of them are general, others are specific for the Italian territory: • MPS19_cluster shows the impact of removing earthquake clusters in terms of seismic hazard. Other methods based on synthetic catalogs may offer some additional interesting new perspectives in this field [e.g., UCERF3; Field et al., 2017].
• One of the basic assumptions in PSHA is that seismicity is stationary in time once the short-term clustering has been removed by declustering. In other words, time dependency on longer time scales is not included in this kind of modeling. This assumption is often taken for granted and not carefully tested. We think that in the future more efforts will be necessary to verify this assumption.
• The magnitude-frequency distribution has a large impact on PSHA. The common assumption of using a Gutenberg-Richter-like distribution over wide areas may be questionable, and currently we do not have enough data to test this assumption, at least for what concerns long-term PSHA purposes.
• GMMs are still mostly empirical and very likely not able to capture the details of IM variability at the spatial scale of interest for PSHA. Moving from general GMMs to more specific models which account for the regional geology may lead to significant variations of PSHA.
As a final remark, we argue that the procedures adopted in MPS19 make the comparison with the previous model MPS04 challenging. In particular, the elements of substantial difference are: i) MPS04 represents the median of the branches, MPS19 the average; ii) MPS04 represents the maximum horizontal component of the shaking, whereas MPS19 the geometric mean; iii) MPS04 made extensive use of expert judgment essentially aimed at introducing precautionary factors, whereas MPS19 keeps this judgment to a minimum, avoiding the use of precautionary components; iv) the basic data are completely different and up to date (more than 15 years have passed); v) MPS19 is based on a fully revised seismic catalog and magnitude estimated with more advanced procedures; vi) some earthquakes in MPS04 (based on the CPTI04 catalog) were aftershocks that have been removed in MPS19 by the declustering method; vii) MPS04 and MPS19 use completely different GMMs; viii) accelerometric databases used to define GMMs contain marked differences on the soil classification for the seismic stations.
Owing to these reasons, a detailed appraisal of the differences will be addressed in a separate future paper.