Notebook for Lab 2 ¶

Linear Regression in Python ¶

Step 1. Load Advertising Dataset from the Textbook Website. ¶

Use pandas.read_csv function from pandas package to load the dataset.

The filepath provided to the function can be a local file or URL.


    index_col=0

tells the function to use the first column as the row index. (Try to see what happens if not setting the argument)

Complete manual on


    pandas.read_csv()

is here https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Reading from other files format are similar, for example:


    pandas.read_table()


    pandas.read_json()


    pandas.read_excel()

Step 2. Visual Check the Data ¶

To visually check the linear relationship between variables, we check the pair-wise scatter plot.

Note: _ is a placeholder to suppress the output. (Try without _ = ).

Step 3. Run Linear Regression ¶

Use statsmodels.formula.api.ols to run linear regression in an R style

Note: In Python, calling smf.ols defines a model (not yet fitted). Need to call model.fit() to actually generate results.

import statsmodels.formula.api as smf 
model = smf.ols(formula="sales ~ TV + radio + newspaper", data=ads)
res = model.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Thu, 14 Sep 2023   Prob (F-statistic):           1.58e-96
Time:                        19:12:50   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                  sales   R-squared (uncentered):                   0.982
Model:                            OLS   Adj. R-squared (uncentered):              0.982
Method:                 Least Squares   F-statistic:                              3566.
Date:                Thu, 14 Sep 2023   Prob (F-statistic):                   2.43e-171
Time:                        19:12:50   Log-Likelihood:                         -423.54
No. Observations:                 200   AIC:                                      853.1
Df Residuals:                     197   BIC:                                      863.0
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
TV             0.0538      0.001     40.507      0.000       0.051       0.056
radio          0.2222      0.009     23.595      0.000       0.204       0.241
newspaper      0.0168      0.007      2.517      0.013       0.004       0.030
==============================================================================
Omnibus:                        5.982   Durbin-Watson:                   2.038
Prob(Omnibus):                  0.050   Jarque-Bera (JB):                7.039
Skew:                          -0.232   Prob(JB):                       0.0296
Kurtosis:                       3.794   Cond. No.                         12.6
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
model2 = smf.ols(formula="np.log(1+sales) ~ np.log(1+TV) + np.log(1+radio) + np.log(1+newspaper)", data=ads)
print(model2.fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      np.log(1 + sales)   R-squared:                       0.938
Model:                            OLS   Adj. R-squared:                  0.937
Method:                 Least Squares   F-statistic:                     983.6
Date:                Thu, 14 Sep 2023   Prob (F-statistic):          7.71e-118
Time:                        19:12:50   Log-Likelihood:                 190.46
No. Observations:                 200   AIC:                            -372.9
Df Residuals:                     196   BIC:                            -359.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 0.5205      0.043     11.968      0.000       0.435       0.606
np.log(1 + TV)            0.3280      0.007     47.161      0.000       0.314       0.342
np.log(1 + radio)         0.1876      0.008     24.849      0.000       0.173       0.202
np.log(1 + newspaper)     0.0135      0.008      1.715      0.088      -0.002       0.029
==============================================================================
Omnibus:                       27.667   Durbin-Watson:                   1.938
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              110.636
Skew:                           0.395   Prob(JB):                     9.45e-25
Kurtosis:                       6.557   Cond. No.                         42.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Step 4. Interpret the Results¶
The first panel gives general information on the model fit including $R^2$, $R_a^2$, F-test of the model:
$$H_0:\beta_{TV}=\beta_{radio}=\beta_{newspaper}=0\quad v.s.\quad H_a:\text{at least one of them is not zero}$$
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                                   Prob (F-statistic):           1.58e-96
Time:                                   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
The second panel lists covariate-wise fitting results, including the estimate, the standard error and t-test result (and its p-value):
$$H_0: \beta_i=0\quad v.s.\quad H_a:\beta_i\neq 0$$
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
The third panel gives testing results on the residuals. Testing whether the residuals are normal or equal variance.
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================
The footnote part gives certain clarifications.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Entries from the summary table can be directly extracted from the fit result.
Some useful attributes:
params: $\hat\beta$
bse: $\mathrm{se}(\hat\beta)$
pvalues: p-values for the coefficients
mse_resid: $s_e^2$
fittedvalues: $\hat y$
resid: $\hat\epsilon$
For complete list, please check: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.html
   df_resid          ssr  df_diff      ss_diff           F        Pr(>F)
0     198.0  5134.804544      0.0          NaN         NaN           NaN
1     196.0   556.825263      2.0  4577.979281  805.714107  2.812622e-95
/var/folders/35/5fk7khtd72111mjm7hn2lg240000gn/T/ipykernel_21001/1634934414.py:4: UserWarning: The figure layout has changed to tight
  fig.tight_layout()
/var/folders/35/5fk7khtd72111mjm7hn2lg240000gn/T/ipykernel_21001/597473604.py:2: UserWarning: The figure layout has changed to tight
  fig.tight_layout()
Use Your Own Code¶
Imagine you lose your access to statsmodels library. Then you need to code the formulas covered in lectures by your own.
1. Create data¶
y = ads['sales']
X = ads.loc[:, ~ads.columns.isin(['sales'])] # all columns except for sales
X.insert(0, 'intercept', 1) # insert intercept 1 to the left
print(X)
     intercept     TV  radio  newspaper
1            1  230.1   37.8       69.2
2            1   44.5   39.3       45.1
3            1   17.2   45.9       69.3
4            1  151.5   41.3       58.5
5            1  180.8   10.8       58.4
..         ...    ...    ...        ...
196          1   38.2    3.7       13.8
197          1   94.2    4.9        8.1
198          1  177.0    9.3        6.4
199          1  283.6   42.0       66.2
200          1  232.1    8.6        8.7
[200 rows x 4 columns]
2. Calculate $\mathbf{\hat\beta}$¶
Method 1: direct implementation. 
$$\mathbf{\hat\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$$
numpy.linalg.inv gives the matrix inverse.
Method 2: normal equation.
$$\mathbf{X}'\mathbf{X}\mathbf{\hat\beta}=\mathbf{X}'\mathbf{y}$$
numpy.linalg.solve to solve linear equations.
4. Calculate the standard errors for $\hat\beta$¶
Recall that
$$\mathrm{Cov}[\mathbf{\hat\beta}] = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}$$
$$\mathrm{se}^2(\hat\beta_i)= s_e^2\left[(\mathbf{X}'\mathbf{X})^{-1}\right]_{ii}$$
from scipy.stats import t 
t_stat = np.abs(beta_hat / se_beta_hat)
pvalues = 2 * t.cdf(-t_stat, df=200-4)
print(pvalues)
SST = np.sum((y - y.mean()) ** 2)
SSR = np.sum((y_hat - y_hat.mean()) ** 2)
SSE = SST - SSR
R2 = SSR/SST 
R2a = 1 - (SSE / (200 - 4)) / (SST / (200 - 1))
print(R2, R2a)