This map presents the percent likelihood of a magnitude 5.5 or greater event in any 7 day period. These predictions were calculated from USGS data in the 30 year period between 1989 and 2019. Read below to learn more.

*Authored by Chris Luedtke*

Some seismologists claim earthquake prediction is an inherently impossible task. While respecting this perspective, we maintain a healthy optimism that technical advancements will make this critical problem tractable. In the development of our above model, we reviewed recent academic publications implementing new machine learning techniques with some success. We link to those articles and provide notes on GitHub. Below we present our process in developing our own baseline predictive model.

We begin our analysis with data made available by the USGS, which was the most accessible and complete catalog we found. The USGS is transparent that this catalog is not complete and should not be treated as such. Incomplete data are a universal challenge in earthquake modeling and in data science more broadly.

In this case, missing data can be categorized in two ways: 1) missing due to recency and 2) missing due to detective ability.

To assess this category of missing data, we consider magnitude 4 or greater earthquake events occurring in 2019 to date.

In **Figure 1** we see a dramatic decrease in magnitude 4-5 events in the last month. We presume this does not represent actual earthquake activity, but rather suggests missing data.

The USGS catalog is constantly evolving as earthquake events are created or modified retroactively. This is demonstrated in **Figure 2**, which shows that the majority are updated more than 50 days after they occur.

Missing data due to recency make real-time predictive models difficult to deploy. If we were to update a predictive model on current events, our model would get the mistaken impression that fewer earthquakes had occured recently than have actually taken place. We could only base predictions on events which the USGS can reliably publish on a timely basis for each region on Earth.

We did not employ a rigorous solution to this category of missing data. We limited our analysis to events prior to 2019 in hopes that the majority of earthquake events had already been logged.

Earthquake sensors across the world are incapable of detecting even a majority of earthquake events. To demonstrate, we consider all magnitude 2 or greater earthquakes between 1989 and 2019.

The Gutenberg–Richter law provides a starting point to assess these missing data:

The Gutenberg–Richter law expresses the relationship between the magnitude and total number of earthquakes in any given region and time period ofat leastthat magnitude.log

_{10}N= a - bM

Seismologists use this law to identify the magnitude threshold, M_{c}, above which an earthquake catalog is deemed to be complete. We did the same for the USGS catalog by iteratively increasing M_{c}, computing a linear regression, and keeping the M_{c} for which the linear regression best fit our observed data. **Figure 3** displays our results.

These missing data are handled by simply dropping all records below the identified M_{c}.

For each previously mentioned data quality concern, one must also consider regional characteristics. For example, we expect that the USGS both detects more earthquakes in the United States and also publishes those events more rapidly.

There are a number of approaches to this dimension of our analysis. Ideally, we would access a database containing precise coordinate locations and time periods of all seismic sensor stations on earth. Alternatively, we would find existing seismic region definitions based on observed seismic properties.

In our approach, we place 4,000 approximately equidistant nodes across the Earth and assess the data nearest to each node. See **Figure 4**.

Since we ultimately care about predicting large earthquakes, we first cluster only magnitude 5.5 or greater events to their nearest node. Limiting to these nodes greatly reduces computation cost, as we needed only 22% of our 4,000 node locations to capture all 5.5+ magnitude events. **Figure 5** displays these nodes.

With the important nodes identified, we clustered all earthquake events to these filtered nodes. We applied a distance filter consistent with the maximum distance that any 5.5+ magnitude event occured from its nearest node (about 250 kilometers).

To get our baseline predictive model, for each cluster we counted the number of weeks in which a 5.5+ magnitude event occured. We then divided this by the total number of weeks in our dataset. This gives a percent likelihood of 5.5+ magnitude earthquake for each node. A cluster in which 15 weeks contained magnitude 5.5+ earthquakes corresponds to roughly a 1% chance of 5.5+ magnitude earthquake in any given week.

This baseline relies on the assumption that the USGS has accurately cataloged all 5.5+ magnitude earthquakes since 1989.

Finally, we re-added the nodes in which no 5.5+ magnitude earthquakes have occured and assigned 0% likelihood to those nodes. This gave us a complete grid of nodes over the Earth along with their earthquake likelihood values. We used these values to produce the final contour map presented above.