Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     956.6
Date:                Thu, 21 May 2015   Prob (F-statistic):           1.96e-41
Time:                        05:56:14   Log-Likelihood:                 1.2217
No. Observations:                  50   AIC:                             5.557
Df Residuals:                      46   BIC:                             13.20
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9654      0.084     59.175      0.000         4.796     5.134
x1             0.5088      0.013     39.314      0.000         0.483     0.535
x2             0.5651      0.051     11.109      0.000         0.463     0.668
x3            -0.0206      0.001    -18.144      0.000        -0.023    -0.018
==============================================================================
Omnibus:                        0.840   Durbin-Watson:                   2.269
Prob(Omnibus):                  0.657   Jarque-Bera (JB):                0.577
Skew:                          -0.263   Prob(JB):                        0.749
Kurtosis:                       2.972   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.44997408   4.96265684   5.43161584   5.82605136   6.12627912
   6.32696437   6.4379984    6.4828734    6.49482276   6.51136093
   6.56811991   6.69299498   6.90156162   7.1945165    7.55756297
   7.96376004   8.37794864   8.76252818   9.08363424   9.31670235
   9.45050393   9.48899104   9.45064714   9.36545023   9.26994764
   9.20125134   9.19094055   9.25987342   9.41476003   9.64705998
   9.93438555  10.24417991  10.53906618  10.78298825  10.9471348
  11.01467283  10.98351335  10.86665452  10.69004615  10.4883262
  10.29912985  10.15690615  10.08725815  10.10273638  10.20077685
  10.36412228  10.56365745  10.76319271  10.92540989  11.01799353]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.00539682  10.84456472  10.55765996  10.1951886    9.82363439
   9.50918122   9.30150901   9.2216303    9.25674569   9.36337756]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.965353
x1                  0.508755
np.sin(x1)          0.565142
I((x1 - 5) ** 2)   -0.020615
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.00539682,  10.84456472,  10.55765996,  10.1951886 ,
         9.82363439,   9.50918122,   9.30150901,   9.2216303 ,
         9.25674569,   9.36337756])

This Page