Earthquake Intensity Comparison Of Mmi And Cdi

Comparison of Maximum CDI to Maximum MMI

Intensity, or the expression / damage level of the earthquake on the surface, is traditionally reported as a Modified Mercalli Intensity (MMI) rating, 0 - 12. In the last 20 years, a new method, using crowd sourced internet reported effects and damage estimates, was established by the USGS.

This project first compared the crowdsourced intensity values, known as “community decimal intensity (CDI), with the MMI, and found that CDI typically evaluates an earthquake’s intensity slightly lower than MMI, but that with the inclusion of additional factors of region and depth, and using the number of CDI reports per quake as a quality measure, that CDI can be reasonably used in place of MMI. This is fortunate, as it is cheap and easy to collect, and therefore CDI estimates exist for 3-4 times the earthquakes that have professionally assigned MMI measurements.

# Import the usual libraries for data organization, mathematics, and plotting
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Read in the cleaned earthquake data
# This data was cleaned and exploratory data analysis done in the note book  Capstone-EDA.ipynb

df = pd.read_csv('./datasets/clean_quakes_v2.csv')
df.shape
(114498, 32)
df.head(2)
id long lat depth alert cdi code detail dmin felt ... time title tsunami type types tz updated url magDecade year
0 nc1022389 -121.8735 36.593 4.946 0.0 NaN 1022389 https://earthquake.usgs.gov/fdsnws/event/1/que... 0.03694 NaN ... 1974-12-30 13:28:16.830 M 3.4 - Central California False earthquake focal-mechanism,nearby-cities,origin,phase-data NaN 2016-12-14 18:02:44.940 https://earthquake.usgs.gov/earthquakes/eventp... 3 1974
1 nc1022388 -121.4645 36.929 3.946 0.0 NaN 1022388 https://earthquake.usgs.gov/fdsnws/event/1/que... 0.04144 NaN ... 1974-12-30 09:46:54.820 M 3.0 - Central California False earthquake nearby-cities,origin,phase-data NaN 2016-12-14 18:02:33.974 https://earthquake.usgs.gov/earthquakes/eventp... 2 1974

2 rows × 32 columns

Extract the factors needed for the model


dfm = df[df['mmi'].notnull()][['id','lat', 'long', 'depth', 'mag', 'mmi', 'cdi', 'alert', \
                              'felt', 'time', 'net', 'type',  'gap', 'magDecade', 'year']]
dfm.shape
(2955, 15)
dfm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2955 entries, 1148 to 114485
Data columns (total 15 columns):
id           2955 non-null object
lat          2955 non-null float64
long         2955 non-null float64
depth        2955 non-null float64
mag          2955 non-null float64
mmi          2955 non-null float64
cdi          2107 non-null float64
alert        2955 non-null float64
felt         2107 non-null float64
time         2955 non-null object
net          2955 non-null object
type         2955 non-null object
gap          2744 non-null float64
magDecade    2955 non-null int64
year         2955 non-null int64
dtypes: float64(9), int64(2), object(4)
memory usage: 369.4+ KB
# We must have the CDI data, so drop reacords where it is missing. 
# There are a some records missing in other fields as well, 
# this drops all rows with any value missing

dfm = dfm.dropna()
dfm.shape
(2009, 15)
# CDI values that were created with fewer than 5 crowd sourced reports are less reliable, 
# so we filter those out also, leaving us with 1606 records

dfm[dfm['felt'] >= 5].shape
(1606, 15)

Examine the distributions of the magnitude and intensity

This highlights one of the difficulties with this dataset, that we hae many more earthquakes at lower magnitude and intensith than we have at higher magnitude. While this is good for those living in earthquake zones, it makes it harder to model, as the abundant data at lower values disproportionately affects the regression line fit.

To handle this, I employed a weighted bootstrap, to resample the data with emphasis on retaining many more of the higher magnitude qua5kes. A dataset of the same size was built, but sampling the original (with replacement) and keeping the sample if a randomly generated number was higher than a clip threshold for each magnitude bin. The sample was retained 25% of the time for magnitudes less than 3, while only 5% of the time for quakes with magnitude between 3 and 4. All samples of quakes with magnitude of 7 or higher were kept, as were 90% of samples with magnitude between 6 and 7. As seen below, this produced a dataset with a more uniform distribution of magnitude, a measure of the energy released during an earthquake.

sns.set(font_scale=1.5)
fig, axs = plt.subplots(ncols=3, figsize=(18,6))
fig.suptitle("Distributions for Earthquakes with CDI and MMI data", fontsize=22)

dfm['mag'].plot(kind='hist', bins=7, range=(0,7), ax=axs[0])
dfm['mmi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[1])
dfm['cdi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[2])
axs[0].set_xlabel('Magnitude', fontsize=18)
axs[1].set_xlabel('Modified Mercalli Intensity', fontsize=20)
axs[2].set_xlabel('Community Decimal Intensity', fontsize=20)
axs[0].locator_params(numticks=9)

png

# bootstrap with filter to obtain samples more equally distributed across earthquake magnitude

from random import random

dfmw = pd.DataFrame()

idx=0
while (len(dfmw) < len(dfm)) and (idx<=20000):
           
    prn = random()
#     sample = pd.DataFrame(dfm.iloc[idx]).T
    sample = dfm.iloc[idx]
   
    if sample['mag'].item() <= 3:
        weight = 0.75
    elif sample['mag'].item() <= 4:
        weight = 0.95
    elif sample['mag'].item() <= 5:
        weight = 0.8
    elif sample['mag'].item() <= 6:
        weight = 0.1
    else:
        weight = 0
        
    if prn >= weight:
        dfmw = dfmw.append(sample, ignore_index=True)
        
    idx += 1    
    if idx == len(dfm):
        idx = 0
    
dfmw.shape



(2009, 15)
# Plot distribution of magnitude, Modified Mercalli Intensity and Community Decimal Intensity

sns.set(font_scale=1.5)
fig, axs = plt.subplots(ncols=3, figsize=(18,6))
fig.suptitle("Distributions after weighted bootstrap", fontsize=22)

dfmw['mag'].plot(kind='hist', bins=7, range=(0,7), ax=axs[0])
dfmw['mmi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[1])
dfmw['cdi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[2])
axs[0].set_xlabel('Magnitude', fontsize=18)
axs[1].set_xlabel('Modified Mercalli Intensity', fontsize=20)
axs[2].set_xlabel('Community Decimal Intensity', fontsize=20)
axs[0].locator_params(numticks=9)

png

# Further restrict the dataset to those samples where 5 or more crowdsourced reports were used to calculate the CDI.

sns.set(font_scale=1.5)
fig, axs = plt.subplots(ncols=3, figsize=(18,6))
fig.suptitle("Distributions after weighted bootstrap: DYFI responses >= 5", fontsize=22)

dfmw[dfmw['felt'] >= 5]['mag'].plot(kind='hist', bins=7, range=(0,7), ax=axs[0])
dfmw[dfmw['felt'] >= 5]['mmi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[1])
dfmw[dfmw['felt'] >= 5]['cdi'].plot(kind='hist', bins=9, range=(0,9), ax=axs[2])
axs[0].set_xlabel('Magnitude', fontsize=18)
axs[1].set_xlabel('Modified Mercalli Intensity', fontsize=20)
axs[2].set_xlabel('Community Decimal Intensity', fontsize=20)
axs[0].locator_params(numticks=9)

png

dfmw.head()
alert cdi depth felt gap id lat long mag magDecade mmi net time type year
0 0.0 8.9 8.950 1141.0 123.4 ci3347678 34.416000 -118.370000 6.6 6.0 8.6 ci 1971-02-09 09:00:41.920 earthquake 1971.0
1 0.0 8.3 15.000 75.0 168.0 ci3352060 32.667333 -115.359167 6.4 6.0 9.5 ci 1979-10-15 19:16:53.910 earthquake 1979.0
2 0.0 8.4 9.578 73.0 137.0 nc1091100 36.231667 -120.312000 6.7 6.0 8.3 nc 1983-05-02 19:42:38.060 earthquake 1983.0
3 0.0 6.8 17.214 2.0 89.0 nc216859 37.036167 -121.879833 6.9 6.0 8.6 nc 1989-10-17 20:04:15.190 earthquake 1989.0
4 0.0 8.6 8.881 859.0 44.4 ci731691 34.061000 -118.079000 5.9 5.0 7.5 ci 1987-10-01 10:42:20.020 earthquake 1987.0

Graphic view of correlation between MMI and CDI.

Three correlation plots are shown.

  • The first compares MMI and CDI for each bin, 2-3, 3-4, 4-5 etc.
  • The second divides samples of magnitude below 3 with those samples with magnitude of 6 or greater
  • The third shows the best fit regression line through all data, regardless of magnitude, and shows the distribution of the two intensity measures for the sample set.

In all cases, it can be seen that the crowdsourced Community Decimal Intensity under reports the intensity for larger magnitude quakes as compared to the Modified Mercalli, expertly determined, Intensity.

def set_color(magbin):
    if magbin == 2:
        color = "0000FF"
    elif magbin == 3:
        color = "007FDD"
    elif magbin == 4:
        color = "008E7F"
    elif magbin == 5:
        color = "80E100"        
    elif magbin == 6:
        color = "FFFF00"
    elif magbin == 7:
        color = "FFA000"
    elif magbin == 8:
        color = "FF5A00"
    else: 
        color = "FF0000"  
    
    return color
dfmw['color'] = dfmw['magDecade'].map(lambda x: set_color(x))
# sns.color_palette(palette=['#007FDD', '#008E7F', '#80E100,' '#FFFF00', '#FFA000', '#FF5A00'], n_colors=6, desat=None)

quake_colors = ['#007FDD',  # Q2
                    '#008E7F',  # Q3
                    '#80E100',  # Q4
                    '#FFFF00',  # Q5
                    '#FFA000',  # Q6
                    '#FF5A00',  # Q7
                   ]

quake_colors2 = ['#007FDD',  # Q2 or Q3
                 '#FF5A00',  # Q6 or Q7
                   ]

dfmw_plot1 = dfmw[(dfmw['felt']>=5) & (dfmw['mmi'] != 0)][['mmi', 'cdi', 'felt', 'magDecade']].copy()
dfmw_plot1['          magSize'] = dfmw_plot1['magDecade'].map(lambda x: 0 if x <= 3 else 1 if x >= 6 else np.nan)
dfmw_plot1.columns = ['Modified Mercalli Intensity', 'Community Decimal Intensity', 'felt', 'magDecade', '          magSize' ]

# fig, ax = plt.subplots(figsize=(8,5))
sns.set(font_scale=2)
sns.lmplot('Modified Mercalli Intensity', 'Community Decimal Intensity', \
           data=dfmw_plot1[(dfmw_plot1['felt']>=5)], \
           hue='magDecade', palette=quake_colors, size=10, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x11a1aa890>

png


sns.set(font_scale=2)
sns.lmplot('Modified Mercalli Intensity', 'Community Decimal Intensity', \
           data=dfmw_plot1[(dfmw_plot1['felt']>=5) & (dfmw_plot1['          magSize'].notnull())], \
           hue='          magSize', palette=quake_colors2, size=10, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x11b7a7d90>

png

sns.set(font_scale=2)
sns.jointplot('Modified Mercalli Intensity', 'Community Decimal Intensity', \
           data=dfmw_plot1[(dfmw_plot1['felt']>=5)], \
           kind = "reg", size=10)
<seaborn.axisgrid.JointGrid at 0x11b860790>

png

Feature Engineering

New factors were created to look more deeply into these relationships

  • samples were binned to the West, Central and Eastern zones of the US
  • factors showing the rounded, integer equivalent intensities were created, as intensity was traditionally reported as integer numbers between 0 and 12.
  • The difference between the intensity, both rounded integer and decimal, was added.

This was done both to see if there might be a closer correlation between the integer MMI and CDI values, but also in preparation for classification models to be run, which will predct the integer MMI value from the CDI and other parameters.

# Feature Engineering 
dfmw['mmi-cdi_difference'] = dfmw['mmi'] - dfmw['cdi']
dfmw['mmi_round'] = dfmw['mmi'].round(0)
dfmw['cdi_round'] = dfmw['cdi'].round(0)
dfmw['diff_rounded'] = dfmw['mmi_round'] - dfmw['cdi_round']
dfmw['EW'] = dfmw['long'].map(lambda x: 'East' if x > -90 else 'Mid' if x > -105 else 'West')

dfmw['cdi_round'] = dfmw['cdi_round'].map(lambda x: int(x))
dfmw['mmi_round'] = dfmw['mmi_round'].map(lambda x: int(x))
dfmw['diff_rounded'] = dfmw['mmi_round'] - dfmw['cdi_round']
dfmw[['mag', 'mmi', 'cdi', 'felt', 'mmi_round', 'cdi_round', 'diff_rounded', 'mmi-cdi_difference' ]].head()
mag mmi cdi felt mmi_round cdi_round diff_rounded mmi-cdi_difference
0 6.6 8.6 8.9 1141.0 9 9 0 -0.3
1 6.4 9.5 8.3 75.0 10 8 2 1.2
2 6.7 8.3 8.4 73.0 8 8 0 -0.1
3 6.9 8.6 6.8 2.0 9 7 2 1.8
4 5.9 7.5 8.6 859.0 8 9 -1 -1.1
Average Difference, Regional effects

The mean difference between all MMI and CDI values is only 0.26, but the standard deviation is 1.34, indicating that there is significant variation in the mean difference values.

There is slightly better MMI / CDI correlation in the Central and Eastern US. In the East, this might be due to denser crustal material allowing earthquake energy to travel longer distances and increasing the area where higher intensity might be felt. This was surprising in the central zone, where there is significant expanse of lower density sedimentary basin geology.

Comparing the integer MMI to the integer CDI using all samples shows the western pattern of CDI underreporting intensity as compared to MMI. This is unsurprising, as there are many more samples of Western US earthquakes.


print dfmw['mmi-cdi_difference'].mean(), dfmw['mmi-cdi_difference'].std()
0.261632653061 1.34716813151
# Dummy the country regional indicator to separate eastern, central and western quakes
dfmw = pd.get_dummies(dfmw, columns=['EW'], prefix='Z', prefix_sep='_')
dfmw = dfmw.drop('Z_Mid', 1)

# Plot correlation of mmi and cdi for quakes in Western, Central and Eastern US
fig, axs = plt.subplots(ncols=3, figsize=(18,6))


sns.regplot('mmi', 'cdi', data=dfmw[(dfmw['felt']>=5) & (dfmw['Z_West']) ], ax=axs[0])

sns.regplot('mmi', 'cdi', data=dfmw[(dfmw['felt']>=5) & ~(dfmw['Z_West']) & ~(dfmw['Z_East']) ],\
              ax=axs[1], color='darkblue')

sns.regplot('mmi', 'cdi', data=dfmw[(dfmw['felt']>=5) & (dfmw['Z_East']) ], ax=axs[2])

axs[0].set_title('Western US')
axs[1].set_title('Central US')
axs[2].set_title('Eastern US')


<matplotlib.text.Text at 0x11a8738d0>

png

# Look at just the traditional integer intensity scales by rounding the decimal values.   
# The pearson correlation coefficient is not as good as it was when using decimal scale intensities above. 

sns.jointplot('mmi_round', 'cdi_round', data=dfmw[dfmw['felt']>100], kind = "reg", size=8)
<seaborn.axisgrid.JointGrid at 0x11b38fcd0>

png

Build Classification Models

I built two classification models to predict MMI from crowd sourced CDI. If this is possible, then the factors which influnce the difference between the Modified Mercalli Intensity and the slightly lower values from the Community Decimal Intensity.

The first model uses

  • depth
  • number of cdi reports
  • zone (western, central, eastern) as factors.

A second model includes the features from the first model, and adds

  • Quality measures of the magnutide
    • magnitude calculation type
    • largest azimuthal gap between stations
    • number of reporting stations
    • primary reporting network
  • Quality of CDI assessment
    • number of “Did You Feel It?” reports for each quake
    • time of day of earthquake (this turned out to not provide any coherant influence)
  • Crust1.0 data - a crustal model with thickness and density of crustal layers, developed by researchers at Lawrence Livermore National Lab. https://igppweb.ucsd.edu/~gabi/crust1.html

The data dictionary for the USGS Commmon Catalog of Earthquake Data is very helpful in understanding the factors used. https://earthquake.usgs.gov/data/comcat/data-eventterms.php

Model 1: Logistic Regression -

Regularization: Ridge, Strength: C=1

This model uses defalt values for regularization, with no attempt to tune this. With these few factors and the assumed orthogonality between them, overfitting is not expected to be a problem, but a gridsearch to find best regulariztion parameters will run next after intial model.

# Import the correct libraries for Logistic Regression, either SciKit Learn or StatsModel

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# import statsmodels.api as sm
y = dfmw['mmi_round'].values
X = dfmw[['cdi', 'depth', 'felt', 'Z_East', 'Z_West']]
# MMI Baseline
dfmw['mmi_round'].value_counts(dropna=False)
4     601
5     324
6     306
3     287
7     174
2      93
8      84
0      72
9      48
1      12
10      8
Name: mmi_round, dtype: int64
# MMI Baseline continued
# 4 is the majority class
baseline = 601.0 / len(dfmw['mmi_round'])
baseline
0.29915380786460927
# Train/Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
# Standardize

# Standardize the data
ss = StandardScaler()
Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)
# Logistic Regression instantiation.  Note that penalty and solver are set to the defaults.

# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for details

lr = LogisticRegression(penalty="l2", C=1, fit_intercept=True, solver="liblinear")

# Next step is to fit the model
lr.fit(Xs_train, y_train)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
# Diplay the model results and coefficients

# The score, intercept and coefficients 
print "Model Score:", lr.score(Xs_train, y_train)
print "Intercept:", lr.intercept_
print "Coefficients:", lr.coef_

# Also the classes
print "\nPrediction Classes:", lr.classes_

Model Score: 0.408961593172
Intercept: [-3.86133794 -4.91173358 -3.91708595 -2.73073976 -1.20072544 -1.7485775
 -1.95038224 -2.88280747 -5.19729505 -5.63174751 -5.5148582 ]
Coefficients: [[-0.76362037  0.17691095 -0.46196057  0.43072877  0.82841482]
 [-0.68832482  0.24066698 -0.01651672  0.04113787  0.3086479 ]
 [-0.88499784  0.47327597 -0.99748266  0.04503197  0.8913371 ]
 [-1.24884694  0.48956827 -1.6316105  -0.49774772 -0.06869755]
 [-0.44766727 -0.13383711 -1.35610309 -0.22481553 -0.72974785]
 [ 0.23155552 -0.47834206 -0.53117561  0.30258659  0.23398375]
 [ 0.46630384 -0.31665001  0.03611068  0.34075793  0.60171148]
 [ 0.92997209 -0.5696308   0.30887668  0.12386617  0.40835953]
 [ 2.01172045  0.3957158   0.19392966 -0.49752179  0.87401229]
 [ 1.82411242 -0.16394402  0.2103106  -0.4700785   0.68128807]
 [ 1.19159181  0.26658051 -0.82296951 -0.03100784  0.13810488]]

Prediction Classes: [ 0  1  2  3  4  5  6  7  8  9 10]
# The default model accuracy  
print lr.score(Xs_test, y_test)
0.394693200663
# Get predicted values to plot against actuals
predicted = lr.predict(Xs_test)

# Save residuals for future modeling
lr_residuals = y_test - predicted
logreg_test = pd.DataFrame({'y_test' : y_test, 'predicted' : predicted, 'residuals' : lr_residuals})
logreg_test = logreg_test[['y_test', 'predicted', 'residuals']]
# logreg_test = logreg_test[['y', 'predicted']]
sns.jointplot(x='y_test', y='predicted', data=logreg_test, kind="reg")
<seaborn.axisgrid.JointGrid at 0x11cc31f10>

png

logreg_test.head()
y_test predicted residuals
0 3 4 -1
1 8 8 0
2 4 4 0
3 5 4 1
4 8 7 1

Model 1: Logistic Regression with GridSearch

Determine best regularization strength and method

While it is plausible that all factors in model one are reasonably orthogonal, and that overfitting with few factors is unlikely, I ran a GridSearch just to make sure. It seems a ridge regularization with bit more strength improves the model.

from sklearn.model_selection import GridSearchCV

# Gridsearch for best C and penalty
gs_params = {
    'penalty':['l1', 'l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,5,100)
}

# Define the GridSearch with CrossValidation and fit the model
lr_gridsearch_m1 = GridSearchCV(LogisticRegression(), gs_params, cv=3, verbose=1, n_jobs=-1)
lr_gridsearch_m1.fit(Xs_train, y_train)
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done 268 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:   19.1s finished





GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.26186e-05, ...,   7.92483e+04,   1.00000e+05]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
# best score on the training data:
print "Best Score:", lr_gridsearch_m1.best_score_

# best parameters:
print "Best Parameters:", lr_gridsearch_m1.best_params_
Best Score: 0.421763869132
Best Parameters: {'penalty': 'l1', 'C': 58.57020818056661, 'solver': 'liblinear'}
# assign the best estimator to a variable:
best_lr_m1 = lr_gridsearch_m1.best_estimator_
# Score it on the testing data:
best_lr_m1.score(Xs_test, y_test)
0.40630182421227196

Number of Samples and Prediction Stability of each Intensity Level

The regression plots above showed varying quality of fit for each magnitude level. Let’s see if that is reflected in the prediction model. Remember the goal here is to see if CDI (crowd sourced) can be equivalent to MMI (expert assessed) intensity. In this model we are comparing only the maximum CDI to the maximum MMI for each earthquake. Thus, we’d like to see a plot with an intercept at 0, and a positive slope 45 degree line, where MMI = CDI at each intensity. We are predicting MMI, so the class in the table below is the MMI prediction.

What we see instead is intercepts below 0 in all cases, but much closer to 0 for intensities with more samples. Of even more interest is the relationship of CDI to MMI (class). The model shows that a larger and larger CDI coefficient is required for each increasing level of intensity in order for it to match the corresponding MMI, i.e the CDI reported maximum intensity numbers are generally lower than MMI numbers from 4 onward, and higer than MMI numbers from 3 downward.

Evidence from the initial correlation scatter plots, from the distribution graphs, and now from the logistic regression model all seem to indicate that maximum CDI intensity is reported in a narrower zone than maximum MMI intensity.

# Examine the relationship of number of samples to intercept and coefficients

for i in range(11):
    for j in range(5):
        coef_T[j][i] = best_lr_m1.coef_[i][j]
        
        
m1_result = pd.DataFrame({'class': best_lr_m1.classes_, 'intercept': best_lr_m1.intercept_, \
              'cdi': coef_T[0], \
              'depth': coef_T[1], \
              'felt': coef_T[2], \
              'Z_East': coef_T[3], \
              'Z_West': coef_T[4], \
              'num_samples': pd.Series(y_train).value_counts().sort_index()})

m1_result = m1_result[['class', 'num_samples', 'intercept', 'cdi', 'depth', 'felt', 'Z_East', 'Z_West']]
m1_result
class num_samples intercept cdi depth felt Z_East Z_West
0 0 46 -4.966626 -0.794547 0.188782 -1.677828 1.235309 2.536336
1 1 9 -20.061206 -0.789206 0.354134 -46.191729 0.000000 1.969268
2 2 66 -29.119896 -0.347485 0.552798 -82.894393 0.000000 2.624119
3 3 210 -5.270519 -1.080204 0.499090 -10.230018 -1.125777 -0.030238
4 4 424 -1.270225 -0.432568 -0.132865 -1.632007 -0.220743 -0.731095
5 5 230 -1.764310 0.236552 -0.488268 -0.553843 0.308218 0.239212
6 6 204 -1.970695 0.472895 -0.324443 0.035069 0.351818 0.623618
7 7 123 -2.953208 0.964297 -0.609777 0.312642 0.134098 0.433104
8 8 60 -7.151520 2.504545 0.472238 0.275177 -0.498556 2.926376
9 9 28 -9.669955 3.199057 -0.193623 0.414633 -0.694726 2.806037
10 10 6 -93.239788 25.281676 8.512454 -100.564489 0.000000 0.000000
# Get predicted values to plot against actuals
gs_predicted_m1 = best_lr_m1.predict(Xs_test)

gs_logreg_test_m1 = pd.DataFrame({'y_test' : y_test, 'predicted' : gs_predicted_m1})
gs_logreg_test_m1 = logreg_test[['y_test', 'predicted']]
sns.jointplot(x='y_test', y='predicted', data=gs_logreg_test_m1, kind="reg")
<seaborn.axisgrid.JointGrid at 0x11f1c2910>

png

Run group of classifiers to find best performing model

The next step is to determine of other classifiers can produce a better model than logistic regression. I used a method described by Jason Brownlee, https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/, to compare classifiers.

In this case, the Random Forrest classifier produced the highest score on the test data as well as on the cross validation of the training data.

# Run lots of classifiers on this and see which perform the best
# Import all the modeling libraries

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, \
                                    KFold, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

# print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, Xs_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(Xs_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, Xs_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(Xs_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(Xs_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison

# Set colors, as this shows up better on the grey back ground
# need to figure out how to set these colors for matplotlib boxplot
# from this color=color param that works for pandas boxplots
color = dict(boxes='DarkGreen', whiskers='DarkOrange', \
                medians='DarkBlue', caps='Gray')

print
sns.set(font_scale=1.5)
fig = plt.figure(figsize=(12,8))
fig.suptitle('Algorithm Comparison using Cross Validation')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()
Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        5         0.409               0.406 +/- 0.005         0.395 
LDA       5         0.457               0.412 +/- 0.006         0.416 
QDA       5         0.004               0.007 +/- 0.001         0.003 
KNN       5         0.776               0.570 +/- 0.020         0.650 
CART      5         0.999               0.745 +/- 0.022         0.778 
RFST      5         0.988               0.747 +/- 0.027         0.784 
GB        5         0.903               0.733 +/- 0.025         0.779 
ADA       5         0.410               0.405 +/- 0.022         0.362 
SVM       5         0.505               0.494 +/- 0.015         0.436 
GNB       5         0.155               0.149 +/- 0.005         0.134 
    MNB  :   failed on this input dataset 
BNB       5         0.351               0.336 +/- 0.012         0.322 

png

Decision Tree Graph

I ran a decision tree classifier simply to create a graph to help explain how decision trees work. This one is useful as it shows how the various factors were used in different branches to maximize purity.

### Run a Decision Tree to produce an example visualization 
from sklearn.tree import DecisionTreeRegressor

dtrN = DecisionTreeRegressor(max_depth=3)
dtrN.fit(X_train, y_train)
dtrN_scores = cross_val_score(dtrN, X_train, y_train, cv=4)

print dtrN_scores, np.mean(dtrN_scores)
[ 0.61398968  0.59913131  0.60118508  0.63269497] 0.611750258258
# feature_names = X.columns

# TEMPLATE CODE
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

# initialize the output file object
dot_data = StringIO() 

# my fit DecisionTreeRegressor object here is: dtr1
# for feature_names i put the columns of my Xr matrix
export_graphviz(dtrN, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=X.columns)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png()) 

png


Random Forest Classifier

The Random Forest is run again, this time to save the model and use it to predict the test set. This is necessary because in the classifier loop above, the individual models are not saved. Random Forest classification is non-deterministic unless a random state is fixed, which it was not in this case. This run produce an even higher accuracy score of 0.803 on the test data.

The plot shows the actual MMI intensities against the predicted MMI intensity for each quake in the test data set. The Pearson correlation on this is 0.93, but there is still some spread in mid range intensity values, i.e. the prediction works better for higher intensity quakes.

# Re-run the Random Forrest model and use to predict MMI
# Save residuals for future model using additional factors

print "Random Forest Classifier using , factors:", list(X_train.columns)

rf = RandomForestClassifier()
rf.fit(Xs_train,y_train)

print "{:14.14}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                ('Random Forest', Xs_train.shape[1], metrics.accuracy_score(y_train, rf.predict(Xs_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, rf.predict(Xs_test)))
        
Random Forest Classifier using factors: ['cdi', 'depth', 'felt', 'Z_East', 'Z_West']
Random Forest      5         0.989               0.336 +/- 0.012         0.803 
rf_test_predicted = rf.predict(Xs_test)
rf_train_predicted = rf.predict(Xs_train)
rf_test_residuals = y_test - rf_test_predicted
rf_train_residuals = y_train - rf_train_predicted

rf_test = pd.DataFrame({'y_test':y_test, 'rf_test_predicted':rf_test_predicted, 'rf_test_residuals':rf_test_residuals})
rf_train = pd.DataFrame({'y_train':y_train, 'rf_train_predicted':rf_train_predicted, \
                        'rf_train_residuals':rf_train_residuals})

rf_test = rf_test[['y_test', 'rf_test_predicted', 'rf_test_residuals']]
rf_train = rf_train[['y_train', 'rf_train_predicted', 'rf_train_residuals']]

sns.set(font_scale=1.5)
sns.jointplot(x='y_test', y='rf_test_predicted', data=rf_test, kind="reg")
<seaborn.axisgrid.JointGrid at 0x12cf78fd0>

png

sns.set(font_scale=2)
rf_test_rename = rf_test.copy()
rf_test_rename.columns = ['Actual', 'Predicted', 'Residual']
sns.lmplot(x='Actual', y='Predicted', data=rf_test_rename, size=10, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x12cf7fc10>

png

Gridsearch of Random Forest model to find best parameters

A grid search to determine the best parameters for:

  • number of trees
  • type of purity calculation
  • max depth of trees

was run. It found that the gini impurity with a max depth of 12 and 30 trees was the best in the search, but the accuracy score was slightly lower. The gridsearch did not, in this case, result in a better model, and plots of actual vs. predicted are substantially the same as the non-gridsearched model above.

# Gridsearch on Random Forest to find best depth
from sklearn.model_selection import GridSearchCV

# Gridsearch for best C and penalty
rfgs_params = {
    'n_estimators':[10,30,50],
    'criterion':['gini', 'entropy'],
    'max_depth':[4, 7, 12,None]
    }

# Define the GridSearch with CrossValidation and fit the model
rf_gridsearch = GridSearchCV(RandomForestClassifier(), rfgs_params, cv=3, verbose=1, n_jobs=-1)
rf_gridsearch.fit(Xs_train,y_train)
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:    2.6s finished





GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [10, 30, 50], 'criterion': ['gini', 'entropy'], 'max_depth': [4, 7, 12, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
# best score on the training data:
rf_gridsearch.best_score_
0.75248933143669983
# best parameters on the training data:
rf_gridsearch.best_params_
{'criterion': 'gini', 'max_depth': 12, 'n_estimators': 30}
# assign the best estimator to a variable:
best_rf = rf_gridsearch.best_estimator_
# Score it on the testing data:
best_rf.score(Xs_test, y_test)
0.78275290215588722
# predict on the test data
rf_predicted = best_rf.predict(Xs_test)
rf_residuals = y_test - rf_predicted
# predict on test (for results) and train (for residuals for future model) data
gsrf_test_predicted = rf.predict(Xs_test)
gsrf_train_predicted = rf.predict(Xs_train)
gsrf_test_residuals = y_test - gsrf_test_predicted
gsrf_train_residuals = y_train - gsrf_train_predicted


gsrf_test_results = pd.DataFrame({'y_test':y_test, 'rf_predicted':gsrf_test_predicted, 'rf_residuals':gsrf_test_residuals})
gsrf_train_res = pd.DataFrame({'y_train':y_train, 'gsrf_train_predicted':rf_train_predicted, \
                        'rf_train_residuals':gsrf_train_residuals})

sns.jointplot(x='y_test', y='rf_predicted', data=gsrf_test_results, kind="reg")
<seaborn.axisgrid.JointGrid at 0x12d6081d0>

png

sns.jointplot(x='y_train', y='gsrf_train_predicted', data=gsrf_train_res, kind="reg")
<seaborn.axisgrid.JointGrid at 0x12e8446d0>

png

# Seems there isn't much residual on the training set left to fit! 
print gsrf_train_res['rf_train_residuals'].abs().sum()
print len(gsrf_train_res['rf_train_residuals'])
19
1406

Include additional Features

At this point I included additional features to attempt to product a better model. The crustal features are from a model produced at Lawrence Livermore National Lab by Gabi Laske, Zhitu Ma, Guy Masters and Michael Pasyanos. https://igppweb.ucsd.edu/~gabi/crust1.html Compressional and Shear wave arrivals from world wide seismic activity were used to determine the thickness and density of 3 crystalline and 3 sedimentary layers of crust at 1 degree intevals (latitude and longitude) WW, or at 64,800 (360 x 180) locations. The data from the location closest to the earthquake epicenter was added for each quake.

  • Crustal Features (Depth to Moho, Densities of crustal layers): _The crustal features are from a model produced at Lawrence Livermore National Lab by Gabi Laske, Zhitu Ma, Guy Masters and Michael Pasyanos. https://igppweb.ucsd.edu/~gabi/crust1.html Compressional and Shear wave arrivals from world wide seismic activity were used to determine the thickness and density of 3 crystalline and 3 sedimentary layers of crust at 1 degree intevals (latitude and longitude) WW, or at 64,800 (360 x 180) locations. The data from the location closest to the earthquake epicenter was added for each quake. _

  • Reporting Station (net): The primary source, or reporting station, of the earthqauke information

  • Magnitude Calculation Type (type) The type of calculation used to determin magnitude, which is dependent on the distance of the earthquake from the recording station

  • Largest Azimuthal Gap (gap): The largest angle between recording stations, larger angles reduce the accuracy of location and depth calculations.

  • Number of stations reporting P and S arrival times (nst): Number of stations used to determine earthquake location.

  • Earthquake Time of Day: _The hour of the day when earthquake occurred. This was included to see if people reported intensity effects differently if they were awakend from sleep, or out and about at midday. However, while the hour of day did show up as factor after regularization, no coherent pattern was able to be seen. This model needs to be repeated without the hours included.

  • Number of DYFI reports (felt): Total number of crowd sourced intensity reports available for each quake. Higher numbers of reprots will likely result in better accuracy when aggregated.

dfmw.head()
alert cdi depth felt gap id lat long mag magDecade ... time type year color mmi-cdi_difference mmi_round cdi_round diff_rounded Z_East Z_West
0 0.0 8.9 8.950 1141.0 123.4 ci3347678 34.416000 -118.370000 6.6 6.0 ... 1971-02-09 09:00:41.920 earthquake 1971.0 FFFF00 -0.3 9 9 0 0 1
1 0.0 8.3 15.000 75.0 168.0 ci3352060 32.667333 -115.359167 6.4 6.0 ... 1979-10-15 19:16:53.910 earthquake 1979.0 FFFF00 1.2 10 8 2 0 1
2 0.0 8.4 9.578 73.0 137.0 nc1091100 36.231667 -120.312000 6.7 6.0 ... 1983-05-02 19:42:38.060 earthquake 1983.0 FFFF00 -0.1 8 8 0 0 1
3 0.0 6.8 17.214 2.0 89.0 nc216859 37.036167 -121.879833 6.9 6.0 ... 1989-10-17 20:04:15.190 earthquake 1989.0 FFFF00 1.8 9 7 2 0 1
4 0.0 8.6 8.881 859.0 44.4 ci731691 34.061000 -118.079000 5.9 5.0 ... 1987-10-01 10:42:20.020 earthquake 1987.0 80E100 -1.1 8 9 -1 0 1

5 rows × 22 columns

def get_hour(timestr):
    timesplit = timestr.split()[1]
    hourstr = timesplit.split(":")[0]
    return hourstr

dfmw['tod_hour'] = dfmw['time'].map(lambda x: x.split()[1].split(":")[0])


# Dummy the hour of quake occurrence - for example, are dyfi reports different if quakes happen in night hours
dfmw = pd.get_dummies(dfmw, columns=['tod_hour'], prefix='H', prefix_sep='_')
dfmw = dfmw.drop('H_23', 1)

# Dummy the seismographic network
dfmw = pd.get_dummies(dfmw, columns=['net'], prefix='N', prefix_sep='_')
dfmw = dfmw.drop('N_us', 1)

dfmw.head()
alert cdi depth felt gap id lat long mag magDecade ... H_22 N_ci N_ld N_mb N_nc N_nm N_nn N_se N_uu N_uw
0 0.0 8.9 8.950 1141.0 123.4 ci3347678 34.416000 -118.370000 6.6 6.0 ... 0 1 0 0 0 0 0 0 0 0
1 0.0 8.3 15.000 75.0 168.0 ci3352060 32.667333 -115.359167 6.4 6.0 ... 0 1 0 0 0 0 0 0 0 0
2 0.0 8.4 9.578 73.0 137.0 nc1091100 36.231667 -120.312000 6.7 6.0 ... 0 0 0 0 1 0 0 0 0 0
3 0.0 6.8 17.214 2.0 89.0 nc216859 37.036167 -121.879833 6.9 6.0 ... 0 0 0 0 1 0 0 0 0 0
4 0.0 8.6 8.881 859.0 44.4 ci731691 34.061000 -118.079000 5.9 5.0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 53 columns

# Add crustal thickness and density data from CRUST1.0
# Laske, G., Masters., G., Ma, Z. and Pasyanos, M., Update on CRUST1.0 - A 1-degree Global Model of Earth's Crust, 
# Geophys. Res. Abstracts, 15, Abstract EGU2013-2658, 2013.

rho = pd.read_csv('/Users/erhepp/fortran/crust1.0/crust1.rho', delim_whitespace=True, header=None, \
                 names=['top_water', 'bottom_water', 'bottom_ice', 'bottom_sed1', 'bottom_sed2', 'bottom_sed3', \
                       'bottom_crust1', 'bottom_crust2', 'bottom_crust3'])

bnds = pd.read_csv('/Users/erhepp/fortran/crust1.0/crust1.bnds', delim_whitespace=True, header=None, \
                 names=['top_water', 'bottom_water', 'bottom_ice', 'bottom_sed1', 'bottom_sed2', 'bottom_sed3', \
                       'bottom_crust1', 'bottom_crust2', 'bottom_crust3'])

print rho.shape, bnds.shape

def get_crust (qlat, qlong, crust_param):
# Return crustal density and thickness data for a given latitude / longitude
# When called with crust_param = 'rho'  returns lat, long and 9 values, 0-2 are surface water and ice, 
#    3-5 are sedimentary layer densities, and 6-8 are crystalline crust densities 
# When called with crust_param = 'bnds' the last value in list depth to the Mohorovičić discontinuity,
#    the boundary between the Earth's crust and mantle.


    # Prevent any float representation uncertainty from pushing latitude and longitude out of range
    if qlat >= 89.999: qlat = 89.999
    if qlat <= -89.999: qlat = -89.999
    if qlong >= 179.999: qlong = 179.999    
    if qlong <= -179.999: qlong = -179.999 
    
    # Find the closest crust1.0 grid locatio to the requested latitude and longitude
    if qlat >= 0:
        rlat = round(qlat - 0.5) + 0.5
    else:
        rlat = round(qlat + 0.5) - 0.5
   
    if qlong >= 0:
        rlong = round(qlong - 0.5) + 0.5
    else:
        rlong = round(qlong + 0.5) - 0.5

    # Determine the index in the crust1.0 files that corresponds to the closest grid location  
    # Information from the crust1.0 readme file
    #   The model is defined from 89.5 to -89.5 deg latitude and -179.5 to 179.5 deg
    #   longitude. Longitudes are the inner loop, i.e. all longitudes are stored
    #   for each latitude, then the next latitude is given. The model starts at
    #   89.5 N and 179.5 W.
    
    dist_from_north = 89.5 - rlat
    dist_from_west = 179.5 + rlong
    idx = int((dist_from_north)*360 + dist_from_west)
    
    # Calculate the latitude and longitude of the crust1.0 datapoint, and return the index loacation
    # latitude, longitude and, for each crustal layer, the density data.
    # Information from the crust1.0 readme file on order and meaning of values
    #    1) top of water
    #    2) bottom of water
    #    3) bottom of ice
    #    4) bottom of sediments 1
    #    5) bottom of sediments 2
    #    6) bottom of sediments 3
    #    7) bottom of cryst. crust 1
    #    8) bottom of cryst. crust 2
    #    9) bottom of cryst. crust 3 = Moho (depth to Moho, not crustal thickness!)

    # For this project, we will only use the density values for sediment and crystalline crust layers
    
    flat = 90 - dist_from_north-0.5
    flon = -180. + dist_from_west+0.5
    if crust_param == 'rho':
        return [idx, flat, flon, round(rho.iloc[idx,0],2), round(rho.iloc[idx,1],2), \
                                             round(rho.iloc[idx,2],2), round(rho.iloc[idx,3],2), \
                                             round(rho.iloc[idx,4],2), round(rho.iloc[idx,5],2), \
                                             round(rho.iloc[idx,6],2), round(rho.iloc[idx,7],2), \
                                             round(rho.iloc[idx,8],2)]
    if crust_param == 'bnds':
        return [idx, flat, flon, round(rho.iloc[idx,0],2), round(bnds.iloc[idx,1],2), \
                                             round(bnds.iloc[idx,2],2), round(bnds.iloc[idx,3],2), \
                                             round(bnds.iloc[idx,4],2), round(bnds.iloc[idx,5],2), \
                                             round(bnds.iloc[idx,6],2), round(bnds.iloc[idx,7],2), \
                                             round(bnds.iloc[idx,8],2)]


(64800, 9) (64800, 9)
dfmw.shape
(2009, 53)
for idx in range(len(dfmw)):
    dfmw.loc[idx, 'crust_thickness'] = np.abs(get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'bnds')[-1])
    dfmw.loc[idx, 'sed_density1'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[6]
    dfmw.loc[idx, 'sed_density2'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[7]
    dfmw.loc[idx, 'sed_density3'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[8]
    dfmw.loc[idx, 'cryst_crust1_density'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[9]
    dfmw.loc[idx, 'cryst_crust2_density'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[10]
    dfmw.loc[idx, 'cryst_crust3_density'] = get_crust(dfmw.loc[idx, 'lat'], dfmw.loc[idx, 'long'], 'rho')[11]

print dfmw.shape
dfmw.head()
    

(2009, 60)
alert cdi depth felt gap id lat long mag magDecade ... N_se N_uu N_uw crust_thickness sed_density1 sed_density2 sed_density3 cryst_crust1_density cryst_crust2_density cryst_crust3_density
0 0.0 8.9 8.950 1141.0 123.4 ci3347678 34.416000 -118.370000 6.6 6.0 ... 0 0 0 29.37 2.28 0.0 2.79 2.86 2.95 3.30
1 0.0 8.3 15.000 75.0 168.0 ci3352060 32.667333 -115.359167 6.4 6.0 ... 0 0 0 25.44 2.28 0.0 2.74 2.78 2.86 3.32
2 0.0 8.4 9.578 73.0 137.0 nc1091100 36.231667 -120.312000 6.7 6.0 ... 0 0 0 25.91 2.28 0.0 2.79 2.86 2.95 3.29
3 0.0 6.8 17.214 2.0 89.0 nc216859 37.036167 -121.879833 6.9 6.0 ... 0 0 0 25.30 2.28 0.0 2.79 2.86 2.95 3.30
4 0.0 8.6 8.881 859.0 44.4 ci731691 34.061000 -118.079000 5.9 5.0 ... 0 0 0 29.37 2.28 0.0 2.79 2.86 2.95 3.30

5 rows × 60 columns

factors = [x for x in dfmw.columns if x not in ['alert', 'id', 'lat', 'long', 'magDecade', 'time', 'year', 'color', \
                                                'mmi', 'mmi-cdi_difference', 'mmi_round', 'cdi_round', 'diff_rounded', \
                                                'type']]
print factors

ya = dfmw['mmi_round'].values
Xa = dfmw[factors]
# Modeled in prior model
# X = dfmw[['cdi', 'depth', 'felt', 'Z_East', 'Z_West']]
['cdi', 'depth', 'felt', 'gap', 'mag', 'Z_East', 'Z_West', 'H_00', 'H_01', 'H_02', 'H_03', 'H_04', 'H_05', 'H_06', 'H_07', 'H_08', 'H_09', 'H_10', 'H_11', 'H_12', 'H_13', 'H_14', 'H_15', 'H_16', 'H_17', 'H_18', 'H_19', 'H_20', 'H_21', 'H_22', 'N_ci', 'N_ld', 'N_mb', 'N_nc', 'N_nm', 'N_nn', 'N_se', 'N_uu', 'N_uw', 'crust_thickness', 'sed_density1', 'sed_density2', 'sed_density3', 'cryst_crust1_density', 'cryst_crust2_density', 'cryst_crust3_density']
Xa.head()
cdi depth felt gap mag Z_East Z_West H_00 H_01 H_02 ... N_se N_uu N_uw crust_thickness sed_density1 sed_density2 sed_density3 cryst_crust1_density cryst_crust2_density cryst_crust3_density
0 8.9 8.950 1141.0 123.4 6.6 0 1 0 0 0 ... 0 0 0 29.37 2.28 0.0 2.79 2.86 2.95 3.30
1 8.3 15.000 75.0 168.0 6.4 0 1 0 0 0 ... 0 0 0 25.44 2.28 0.0 2.74 2.78 2.86 3.32
2 8.4 9.578 73.0 137.0 6.7 0 1 0 0 0 ... 0 0 0 25.91 2.28 0.0 2.79 2.86 2.95 3.29
3 6.8 17.214 2.0 89.0 6.9 0 1 0 0 0 ... 0 0 0 25.30 2.28 0.0 2.79 2.86 2.95 3.30
4 8.6 8.881 859.0 44.4 5.9 0 1 0 0 0 ... 0 0 0 29.37 2.28 0.0 2.79 2.86 2.95 3.30

5 rows × 46 columns

# Train/Test Split 
Xa_train, Xa_test, ya_train, ya_test = train_test_split(Xa, ya, test_size=0.30, random_state=42)
# Standardize the data

ss = StandardScaler()
Xas_train = ss.fit_transform(Xa_train)
Xas_test = ss.transform(Xa_test)
# The first example uses SciKit Learn

# Logistic Regression instantiation.  Note that penalty and solver are set to the defaults.
# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for details

lr_all = LogisticRegression(penalty="l2", C=1, fit_intercept=True, solver="liblinear")

# fit the model
lr_all.fit(Xas_train, ya_train)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
# Review the results

# The score, intercept and coefficients 
print "Model Score (Train):", lr_all.score(Xas_train, ya_train)
print "\nModel Score (Test):",  lr_all.score(Xas_test, ya_test)
print "\nPrediction Classes:", lr_all.classes_

print "\nIntercept:", lr_all.intercept_
print "\nCoefficients:", lr_all.coef_


Model Score (Train): 0.663584637269

Model Score (Test): 0.620232172471

Prediction Classes: [ 0  1  2  3  4  5  6  7  8  9 10]

Intercept: [-5.75596323 -5.74830708 -5.53650979 -3.39372938 -1.38765705 -2.17328594
 -3.01130824 -4.93942256 -6.78984715 -6.47045241 -5.84615273]

Coefficients: [[ -2.58341801e-01   1.91478856e-01  -2.29945447e-01  -2.45489423e-01
    4.37757223e-01   2.21934589e-02   1.54707869e-01  -1.87088332e-01
   -5.35097681e-01  -2.75771698e-01  -3.60223784e-01  -7.81297490e-02
    1.42051173e-01  -2.39415523e-01  -4.03669183e-01  -2.96791777e-01
   -9.01655126e-01  -8.42585622e-02  -2.04741127e-01  -6.74128957e-01
   -3.59423019e-01  -1.49681086e-01  -8.57339498e-02  -5.85560246e-02
   -2.60128290e-01  -9.39246750e-02  -1.84595102e-01  -2.06227241e-01
   -7.65406963e-02  -6.22977405e-01  -2.53368720e-01   2.03628579e-01
   -1.81248265e-02  -7.66957985e-01  -3.38863343e-02  -1.78920149e-01
    4.17059890e-02  -3.27152934e-02  -1.87850642e-01  -9.83190861e-02
    1.29068907e-01   7.97768683e-02  -1.74952273e+00   7.67873522e-03
   -1.80658421e-01  -1.07923055e-01]
 [ -2.27931823e-01   2.50405059e-01   2.30624326e-01   3.00027384e-01
   -5.80384851e-01  -1.18094073e-02  -1.52779391e-01   3.51385986e-01
   -5.74165902e-02   1.00005828e-02   1.01988273e-02   9.51453048e-03
    9.08035758e-02   2.88776623e-02  -2.96653218e-02   9.15688439e-03
    5.99308537e-01   5.99473425e-02   2.99340833e-02   6.43784476e-02
    1.58238677e-01   1.21347829e-01   5.58852122e-02   8.22796777e-02
    1.04283611e-02   7.88591079e-02   1.01904899e-01   5.15987842e-02
   -1.31230608e-01   3.22468115e-01   1.15867391e-01  -1.61737705e-01
    1.86934956e-02   4.40258696e-01  -8.75001243e-02  -1.95510812e-02
   -6.90984534e-02   1.74838186e-02  -9.72336672e-02  -1.49466233e-01
   -3.11548369e-01  -3.38265685e-03  -3.28924400e-01   1.72894083e-01
   -1.63670208e-02   1.72378856e-01]
 [ -2.46053690e-01   3.95346544e-01   2.27774072e-01   5.34690200e-01
   -1.89109797e+00   1.62889078e-02   4.57156870e-01  -1.68312012e-01
   -1.13040778e-01  -1.34088191e-01  -1.61072122e-01  -1.56347280e-01
    3.09935564e-01  -9.88337174e-02   2.86014496e-02   5.43300184e-02
   -2.04578635e-01  -9.71918401e-02  -2.11469968e-01   1.38815351e-02
   -1.04819754e-01  -1.19110666e-01  -2.41623889e-01   2.13106550e-01
   -1.26511898e-01   1.50728799e-01  -2.45316741e-01  -1.05864811e-01
    2.48375886e-01   4.30389194e-02   1.58733790e-01  -8.75275522e-02
   -4.27519753e-02   1.34086128e+00   9.39169484e-04  -2.34276727e-01
    3.31546777e-02   4.25283321e-01   6.86827324e-01   4.46524704e-01
   -2.98006880e-01  -8.55627241e-02  -1.09914217e+00   4.93740066e-01
   -4.31862449e-01   2.23488358e-01]
 [ -5.05629434e-01   4.31357422e-01  -9.57724035e-01   1.05109242e-01
   -1.37911237e+00  -4.52284125e-01   1.69901570e-01  -7.42240758e-02
    6.34844274e-02   1.57581885e-01  -8.99495697e-02   1.65703263e-01
    4.31798761e-02  -1.75757648e-02   1.49383799e-01  -3.94018836e-02
    6.65344969e-02   5.37124796e-02   1.25571337e-01   8.67701382e-02
    2.62135392e-01   5.18302611e-03   3.37869280e-02   1.28488728e-02
    1.76668623e-01   1.15352431e-01  -1.98996464e-01  -9.06119522e-02
   -8.44779302e-02  -1.34417199e-01  -7.59574275e-01  -1.90621281e-01
   -5.11957265e-02  -9.88442708e-02  -3.81274234e-02   4.35751861e-01
   -9.12911616e-02  -1.74702186e-04  -1.91155739e-02   3.23611698e-01
   -2.68965819e-01  -1.87180433e-01  -2.75196319e-01   3.08020579e-01
    3.87734365e-01   1.99707988e-01]
 [ -2.19176905e-01  -1.74572985e-01  -9.05916462e-01  -3.84719232e-02
   -5.75831121e-01  -2.01891783e-02  -4.02029023e-01   1.42115535e-01
    1.16053236e-01   8.03135612e-02   1.30989900e-01  -4.86081508e-02
    2.29566079e-02   1.50151538e-01  -3.68201614e-02  -8.69586420e-03
    9.93499493e-02   1.84818065e-01   2.05218243e-01   1.00073747e-01
   -4.71303100e-02   3.28831592e-01   1.87289508e-01  -4.89639715e-02
    2.07197659e-02   4.17666651e-02   1.86758548e-01   3.00023702e-02
    3.84960184e-02   2.34049126e-01  -6.55991199e-01  -7.39933200e-02
    1.22319832e-01  -5.10101210e-01  -5.50097081e-02  -3.32827248e-02
    7.06837658e-02  -7.14955439e-02  -6.64272221e-01  -1.06179167e-01
    8.31668467e-02  -6.53415330e-02   6.00286332e-01   1.63049133e-01
    3.53667131e-01  -4.22594347e-01]
 [ -1.45822881e-01  -6.74636507e-01  -5.94229346e-01  -6.55182499e-02
    2.87385991e-01   2.44269730e-01  -1.90805380e-01   1.02363578e-02
    9.07241717e-02  -8.13359354e-02   3.84466129e-02  -2.11407677e-01
   -2.11621108e-01  -1.19263872e-01  -2.42700888e-02   7.34786494e-02
    7.02164570e-03  -2.63859116e-01  -6.00493346e-02  -3.88074369e-02
   -5.83352675e-02  -1.71866205e-01   8.62063005e-02   2.46125673e-01
    2.47108262e-02   7.62886486e-02   1.09795341e-01   1.35895637e-01
   -4.39954906e-01   8.93362681e-02   3.08053475e-01   2.58477683e-03
   -2.58210660e-01   4.13702843e-01   2.07192022e-01  -3.12918599e-01
   -3.27072878e-02  -9.89404761e-02  -3.34607025e-01  -9.32801861e-02
   -1.56186730e-01  -1.98465483e-01   6.95041348e-01  -5.26604492e-01
    8.38908899e-01   2.87471377e-02]
 [ -3.80438628e-01   1.04278295e-01   2.54482069e-02  -4.59105382e-01
    1.00117514e+00   1.12728446e-01  -9.23122095e-02  -6.80395835e-01
   -1.24428964e-01   2.82763188e-01   2.04304152e-01   4.87533270e-02
   -6.41612825e-01  -6.78248450e-03  -6.57133372e-02   1.51732556e-01
   -3.74902176e-01  -2.38368739e-01  -8.69475401e-01   1.90237929e-01
   -1.39086397e-01   6.17649984e-03  -2.25150013e-01   5.56756740e-02
    1.92483261e-01  -2.39848897e-01  -1.50280118e-01  -4.32086585e-01
   -6.90023094e-02  -2.60083436e-02   7.47062905e-01   3.82247934e-01
   -2.18634747e-01  -1.23288560e-01  -2.99166109e-01  -3.81865448e-01
   -3.84639229e-01  -6.37028191e-02  -8.53722816e-01  -2.09549575e-01
    5.11025775e-02   1.70092622e-02   1.10912793e+00  -6.10730539e-01
   -1.63028990e-01  -2.27105751e-01]
 [  1.68000738e-01  -5.37269114e-01   3.54676661e-01   4.24462504e-01
    1.08079599e+00   2.98361198e-01   6.58381666e-01   3.90301423e-01
   -7.12033292e-01  -8.76164881e-03  -8.34815518e-02  -7.00222026e-02
   -2.39609774e-01  -6.46382863e-01  -9.96502052e-01  -1.61117224e-02
   -2.07222223e-01  -1.36300461e-01  -3.41203231e-01  -1.38043147e-01
   -7.52919099e-02  -7.98000686e-02   2.55996172e-02  -6.56676984e-01
   -4.80086417e-01  -5.71695614e-01  -4.69670893e-01   6.21693718e-02
    1.46566833e-01  -5.67937470e-01  -1.31391323e-02  -7.72961202e-01
   -2.44796450e-01  -4.51612449e-01   8.34017488e-02   5.72354246e-01
   -1.70345528e-02  -2.94604556e-01  -6.04715940e-01   1.42051224e-01
   -6.86115212e-01   1.35056007e-02   1.12025584e+00   2.70083196e-01
   -2.84421664e-01   1.98467009e+00]
 [  2.18439215e+00  -2.24841365e-01   2.21297671e-01   5.67597673e-01
    8.37281516e-01  -3.15896908e-01   1.16241958e-01   3.32516134e-01
   -2.54964020e-01  -1.53544772e-01  -1.42262221e-01  -6.71692048e-02
   -5.19262367e-01   2.88393830e-01  -6.42787588e-01  -3.49610986e-01
   -6.15711561e-01   3.29098105e-01   3.78636183e-01   7.19099875e-02
    7.56430586e-01   1.20331410e-01  -2.72313267e-01  -2.96105520e-01
    7.00498445e-03  -8.28451758e-01  -1.02621702e-01  -4.37498350e-01
    1.16292366e-01   1.25941806e-01   1.41686744e-01  -9.97362679e-02
    1.96858903e-01   1.16788571e+00   2.24935374e-01   2.81939046e-01
   -7.14402822e-01   2.37622974e-01   6.20238763e-01   5.12981229e-01
    7.39053952e-02   1.65106965e-01  -2.09905318e-01  -3.16692792e-01
    6.70243813e-02  -1.67252935e-01]
 [  5.56039653e-01   9.02953993e-02   3.11600801e-01  -1.03945910e-01
    1.46184381e+00  -1.39483047e-01   2.99068684e-01  -1.21412894e-01
    2.56481056e-01   4.96614849e-02   1.34225809e-01   8.20974693e-02
    4.18032540e-01  -1.97333780e-01   5.70122909e-01  -5.60173101e-02
    5.75959735e-01  -2.17619963e-01  -1.39223260e-01   1.45924369e-01
   -4.62647822e-01  -2.50881668e-01   6.53968617e-02   6.74278352e-02
    1.26839821e-01   3.17488169e-01  -3.37353747e-01   5.86442277e-01
    1.87536555e-01   1.65660956e-01   4.72435933e-01   7.39893092e-02
    1.73586411e-01   2.71700260e-01  -1.73026103e-01  -2.39709182e-01
    5.28587699e-02  -5.54335498e-03   1.87294817e-01   2.98881721e-01
    4.00101938e-01   1.28923509e-01   1.78759814e-01   1.47910695e-01
   -4.13763794e-01   3.93495077e-02]
 [  7.50557151e-01   2.23964940e-01  -4.27437974e-01   1.43556098e-01
    5.04533280e-02   1.59430650e-02  -9.26581001e-03  -5.95097480e-02
    7.05294698e-02  -3.22891457e-02   6.55829502e-02   9.45562296e-02
    3.64520151e-03  -3.19479117e-02  -1.09230515e-01   6.01592604e-02
   -4.69293897e-02  -3.29857158e-02  -2.96722589e-02  -2.08953902e-02
   -9.32412940e-02  -4.87090500e-02  -3.03123013e-02   4.03881532e-02
   -5.48980323e-02  -3.85349143e-02   5.47321110e-01   3.15405845e-02
    7.64015521e-02   6.56503993e-02   1.23208208e-01  -5.17532416e-02
    5.67659739e-02  -2.28983991e-01  -3.52504575e-04  -9.83321506e-02
    1.86748531e-02   4.49005979e-02   1.41055684e-01  -3.06527085e-01
    2.81655228e-01  -1.06503345e-02  -1.25076073e-01  -2.93337607e-01
   -3.26682549e-01   3.01445579e-02]]
# Get predicted values to plot against actuals
all_predicted = lr_all.predict(Xas_test)

logreg_test = pd.DataFrame({'ya_test' : ya_test, 'predicted' : all_predicted})
logreg_test = logreg_test[['ya_test', 'predicted']]
sns.jointplot(x='ya_test', y='predicted', data=logreg_test, kind="reg")
<seaborn.axisgrid.JointGrid at 0x12f039990>

png

logreg_test.head()
ya_test predicted
0 4 4
1 4 4
2 6 4
3 5 6
4 4 4

Set up for GridSearch and try again, determine if the model be further improved.

from sklearn.model_selection import GridSearchCV

# Gridsearch for best C and penalty
gs_params = {
    'penalty':['l1', 'l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,5,100)
}

# Define the GridSearch with CrossValidation and fit the model
lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=3, verbose=1, n_jobs=-1)
lr_gridsearch.fit(Xas_train, ya_train)
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.8min finished





GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.26186e-05, ...,   7.92483e+04,   1.00000e+05]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
# best score on the training data:
lr_gridsearch.best_score_
0.67638691322901845
# best parameters on the training data:
lr_gridsearch.best_params_
{'C': 14.508287784959402, 'penalty': 'l1', 'solver': 'liblinear'}
# assign the best estimator to a variable:
best_lr = lr_gridsearch.best_estimator_
# Score it on the testing data:
best_lr.score(Xas_test, ya_test)
0.67495854063018246
# Get predicted values to plot against actuals
gsall_predicted = best_lr.predict(Xas_test)

gs_logreg_test = pd.DataFrame({'ya_test' : ya_test, 'predicted' : gsall_predicted})
gs_logreg_test = logreg_test[['ya_test', 'predicted']]
sns.jointplot(x='ya_test', y='predicted', data=gs_logreg_test, kind="reg")
<seaborn.axisgrid.JointGrid at 0x12a204310>

png

print len(Xa.columns), len(best_lr.coef_[0]), len(np.round(np.exp(best_lr.coef_[0])*100-100,2))
46 46 46
coef_df = pd.DataFrame({
        'features': Xa.columns,
        'log odds': best_lr.coef_[0],
        'percentage change in odds': np.round(np.exp(best_lr.coef_[0])*100-100,2)
    })
coef_df.sort_values(by='percentage change in odds', ascending=0)
features log odds percentage change in odds
4 mag 3.320124 2666.38
1 depth 1.441819 322.84
3 gap 0.639691 89.59
0 cdi 0.000000 0.00
28 H_05 0.000000 0.00
20 cryst_crust1_density 0.000000 0.00
21 cryst_crust2_density 0.000000 0.00
22 cryst_crust3_density 0.000000 0.00
27 H_04 0.000000 0.00
37 H_14 0.000000 0.00
31 H_08 0.000000 0.00
17 sed_density1 0.000000 0.00
38 H_15 0.000000 0.00
39 H_16 0.000000 0.00
41 H_18 0.000000 0.00
43 H_20 0.000000 0.00
18 sed_density2 0.000000 0.00
23 H_00 0.000000 0.00
16 crust_thickness 0.000000 0.00
11 N_nm 0.000000 0.00
6 Z_West 0.000000 0.00
7 N_ci 0.000000 0.00
8 N_ld 0.000000 0.00
9 N_mb 0.000000 0.00
5 Z_East 0.000000 0.00
13 N_se 0.000000 0.00
12 N_nn 0.000000 0.00
14 N_uu 0.000000 0.00
33 H_10 -0.060049 -5.83
42 H_19 -0.141052 -13.16
40 H_17 -0.351786 -29.66
15 N_uw -0.511928 -40.07
26 H_03 -0.544275 -41.97
34 H_11 -0.609148 -45.62
29 H_06 -0.645657 -47.57
25 H_02 -0.700943 -50.39
2 felt -0.787103 -54.48
44 H_21 -0.976178 -62.33
36 H_13 -1.304471 -72.87
30 H_07 -1.366672 -74.50
24 H_01 -1.989545 -86.32
35 H_12 -3.034955 -95.19
32 H_09 -3.252049 -96.13
45 H_22 -3.599742 -97.27
10 N_nc -4.933407 -99.28
19 sed_density3 -7.470995 -99.94
# Create a subset of "coef_df" DataFrame with most important coefficients
imp_coefs = pd.concat([coef_df.sort_values(by='percentage change in odds', ascending=0).head(4),
                     coef_df.sort_values(by='percentage change in odds', ascending=0).tail(18)])
imp_coefs.set_index('features', inplace=True)
imp_coefs
log odds percentage change in odds
features
mag 3.320124 2666.38
depth 1.441819 322.84
gap 0.639691 89.59
cdi 0.000000 0.00
H_10 -0.060049 -5.83
H_19 -0.141052 -13.16
H_17 -0.351786 -29.66
N_uw -0.511928 -40.07
H_03 -0.544275 -41.97
H_11 -0.609148 -45.62
H_06 -0.645657 -47.57
H_02 -0.700943 -50.39
felt -0.787103 -54.48
H_21 -0.976178 -62.33
H_13 -1.304471 -72.87
H_07 -1.366672 -74.50
H_01 -1.989545 -86.32
H_12 -3.034955 -95.19
H_09 -3.252049 -96.13
H_22 -3.599742 -97.27
N_nc -4.933407 -99.28
sed_density3 -7.470995 -99.94
# Plot important coefficients
sns.set(font_scale=1.3)
imp_coefs['percentage change in odds'].plot(kind = "barh")
plt.title("Percentage change in odds with Ridge regularization")
plt.show()

png

Comments

There is clearly some more investigation required. The hour of day does seem to affect the model, but close examination reveals no coherent pattern, and this is likely overfitted, despite the closeness of the accuracy scores between the training and test datasets.

The CDI, or crowd sourced intensity, is now no longer an important factor in predicting MMI, as the model shows magnitude and depth as the most significant predictors. This is not surprising, as we have shown above that maximum CDI intensity seems to under report when compared with MMI.

Stay tuned to this site for future updates, which I hope to complete soon.

Again run group of classifiers to find best performing model

I ran the loop of different classifiers on the important coefficients determined from logistic regression with regularization. This is questionble, as we shouldn’t assume that the factors unimportant in one model will be unimportant in other models, but in this case, using the can produce a better model than logistic regression. I used a method described by Jason Brownlee, https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/, to compare classifiers.

In this case, the Random Forrest classifier produced the highest score on the test data as well as on the cross validation of the training data.

# Get list of features and re-run model with just the 20 most important features
imp_features = imp_coefs.index
# Set up X and y
Xi = dfmw[factors]
yi = dfmw['mmi_round'].values

# Test Train Split

Xi_train, Xi_test, yi_train, yi_test = train_test_split(Xi, yi, test_size=0.30)
# Standardize 
ss = StandardScaler()

Xis_train = ss.fit_transform(Xi_train)
Xis_test = ss.transform(Xi_test)# Test Train Split

# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

# print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, Xis_train, yi_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(Xis_train,yi_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, Xis_train.shape[1], metrics.accuracy_score(yi_train, this_model.predict(Xis_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(yi_test, this_model.predict(Xis_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure(figsize=(12,8))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()
Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        46         0.700               0.622 +/- 0.012         0.629 
LDA       46         0.701               0.626 +/- 0.024         0.657 
    QDA  :   failed on this input dataset 
KNN       46         0.807               0.633 +/- 0.041         0.708 
CART      46         1.000               0.805 +/- 0.004         0.806 
RFST      46         0.993               0.817 +/- 0.017         0.841 
GB        46         0.974               0.819 +/- 0.021         0.854 
ADA       46         0.415               0.416 +/- 0.041         0.413 
SVM       46         0.829               0.707 +/- 0.025         0.733 
GNB       46         0.230               0.225 +/- 0.013         0.222 
    MNB  :   failed on this input dataset 
BNB       46         0.556               0.509 +/- 0.022         0.529 

png


Written on October 3, 2017