Logo

Robust Linear ModelsΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

Estimation

Load data:

In [2]:
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)

Huber's T norm with the (default) median absolute deviation scaling

In [3]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
            xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[ 9.79189854  0.11100521  0.30293016  0.12864961]
                    Robust linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT
Scale Est.:                       mad
Cov Type:                          H1
Date:                Thu, 21 May 2015
Time:                        05:57:49
No. Iterations:                    19
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000       -60.218   -21.835
var_1          0.8294      0.111      7.472      0.000         0.612     1.047
var_2          0.9261      0.303      3.057      0.002         0.332     1.520
var_3         -0.1278      0.129     -0.994      0.320        -0.380     0.124
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

Huber's T norm with 'H2' covariance matrix

In [4]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[ 9.08950419  0.11945975  0.32235497  0.11796313]

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

In [5]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM

Artificial data with outliers:

In [6]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth

Note that the quadratic term in OLS regression will capture outlier effects.

In [7]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.02448278  0.52714658 -0.01436688]
[ 0.47195451  0.07286341  0.00644729]
[  4.66531069   4.93671942   5.20334118   5.46517597   5.7222238
   5.97448466   6.22195856   6.46464549   6.70254545   6.93565844
   7.16398447   7.38752353   7.60627563   7.82024076   8.02941892
   8.23381012   8.43341434   8.62823161   8.8182619    9.00350523
   9.18396159   9.35963099   9.53051342   9.69660888   9.85791738
  10.0144389   10.16617347  10.31312106  10.45528169  10.59265535
  10.72524205  10.85304178  10.97605454  11.09428034  11.20771917
  11.31637103  11.42023593  11.51931386  11.61360482  11.70310881
  11.78782584  11.86775591  11.942899    12.01325513  12.0788243
  12.13960649  12.19560172  12.24680998  12.29323128  12.33486561]

Estimate RLM:

In [8]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[  4.92561014e+00   5.16693155e-01  -4.23172786e-03]
[ 0.15687747  0.02421976  0.00214308]

Draw a plot to compare OLS estimates to the robust estimates:

In [9]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
Out[9]:
<matplotlib.legend.Legend at 0x7fb59733a390>

Example 2: linear function with linear truth

Fit a new OLS model using only the linear term and the constant:

In [10]:
X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[ 5.60355615  0.38347775]
[ 0.40991973  0.03532034]

Estimate RLM:

In [11]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[ 5.07805872  0.47912024]
[ 0.12256837  0.01056099]

Draw a plot to compare OLS estimates to the robust estimates:

In [12]:
prstd, iv_l, iv_u = wls_prediction_std(res2)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")

Previous topic

Robust Linear Models

Next topic

M-Estimators for Robust Linear Modeling

This Page