Solution file for additional exercise 2.6
-----------------------------------------
The purpose of this solution file is to demonstrate and comment on
Minitab commands that correspond to the Stata solutions presented in VHM
812 for linear regression exercises 1-3. Output is only included for
demonstration of differences between Stata and Minitab.


Exercise 1:
-----------
* open the btb_episodes data set
use btb_episodes.dta, clear

MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’
Worksheet was saved on 07/01/2013

* Q1a: distribution of outcome -intvl-
codebook intvl
histogram intvl

MTB > GSummary 'intvl'.
Summary for intvl 

* Q1b: natural log transformed outcome
generate intvl_ln=ln(intvl)
histogram intvl_ln

MTB > Name C7 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')
MTB > GSummary  'intvl_ln'.
Summary for intvl_ln 


* Q2: simple linear regressions
regress intvl_ln p_rct
regress intvl_ln p_year
regress intvl_ln hdsize

MTB > Fitline 'intv_ln' 'p_rct';
SUBC>   Confidence 95.0.
Regression Analysis: intv_ln versus p_rct 
MTB > Fitline 'intv_ln' 'p_year';
SUBC>   Confidence 95.0.
Regression Analysis: intv_ln versus p_year 
MTB > Fitline 'intv_ln' 'hdsize';
SUBC>   Confidence 95.0.
Regression Analysis: intv_ln versus hdsize


* Q3: multiple linear regression
regress intvl_ln p_rct p_year hdsize

MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'p_year' 'hdsize';
SUBC>   Terms C5 C6 hdsize;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_rct, p_year, hdsize 

* added calculation of correlation between p_rct and hdsize
pwcorr p_rct hdsize

MTB > Correlation 'p_rct' 'hdsize'.
Correlations: p_rct, hdsize 

regress intvl_ln hdsize
estimates store smpl
regress intvl_ln p_rct p_year hdsize
estimates store mltpl
estimates table smpl mltpl

Note: It is possible in Minitab to store estimates in a column in the
worksheet, but any further processing needs to be done manually.


* Q4: predicion intervals for means and individuals (based only on hdsize)
regress intvl_ln hdsize
* doing the calculations yourself
predict pv, xb
predict pv_mean_se, stdp
scalar tstar=invttail(2985,.025) /* using DFE */
generate pv_mean_u=pv + tstar*pv_mean_se
generate pv_mean_l=pv - tstar*pv_mean_se
predict pv_ind_se, stdf
generate pv_ind_u=pv + tstar*pv_ind_se
generate pv_ind_l=pv - tstar*pv_ind_se
twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) ///
  (line pv_mean_u hdsize) (line pv_mean_l hdsize) ///
  (line pv_ind_u hdsize) (line pv_ind_l hdsize)
* using the built in graphics capabilities (only simple linear regression)
twoway (scatter intvl_ln hdsize, sort msize(vsmall)) ///
  (lfitci intvl_ln hdsize, ciplot(rline)) ///
  (lfitci intvl_ln hdsize, stdf ciplot(rline))

MTB > Fitline 'intvl_ln' 'hdsize';
SUBC>   Confidence 95.0;
SUBC>   Ci;
SUBC>   Pi.
Regression Analysis: intvl_ln versus hdsize 

MTB > Subset;
SUBC>   Exclude;
SUBC>   MExclude;
SUBC>   Where "hdsize EQ miss()";
SUBC>   Name "hdsize equals (Missing) excluded".
Subset worksheet hdsize equals (Missing) excluded created.

Results for: hdsize equals (Missing) excluded
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'hdsize';
SUBC>   Terms hdsize;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus hdsize 
MTB > Name C8 "PFITS" C9 "PSEFITS" C10 "CLIM" C11 "CLIM_1" C12 "PLIM" C13 "PLIM_1".
MTB > Predict 'intv_ln';
SUBC>   Nodefault;
SUBC>   CPredictors hdsize;
SUBC>   PFITS 'PFITS';
SUBC>   Psefits 'PSEFITS';
SUBC>   CLIM 'CLIM' - 'CLIM_1';
SUBC>   PLIM 'PLIM' - 'PLIM_1'.
Prediction for intv_ln 

Note: The prediction menu does not work when the column for which
predictions are requested contains missing values. Therefore, the above
commands were run on the dataset excluding missing values for hdsize.


* Q5: R2
regress intvl_ln p_year hdsize
regress intvl_ln p_rct p_year hdsize

Results for: btb_episodes.csv
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_year' 'hdsize';
SUBC>   Terms C6 hdsize;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_year, hdsize 

Note: The listings above already displayed the R2-value for the second model.


Exercise 2:
-----------
* open the btb_episodes data set
use btb_episodes.dta, clear

MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’
Worksheet was saved on 07/01/2013

* compute natural log transformed outcome
generate intvl_ln=ln(intvl)

MTB > Name C7 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')


* Q2: -p_year- as sole predictor
regress intvl_ln p_year	/* continuous predictor */
regress intvl_ln i.p_year /* categorical predictor */

MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_year';
SUBC>   Terms C6;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_year 

MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Categorical 'p_year';
SUBC>   Terms C6;
SUBC>   Constant;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_year 

Note: In order to get the model parametrized in the same way as in Stata
with the first category as the reference level, one needs to have the
"Coding for categorical predictors" as (1,0), in the Coding submenu; this
is the default in the Regression menu, but not in the General Linear Model 
menu. With the alternative coding (-1,0,+1), the parameters will be 
determined from the restriction that the sum of all parameters across 
categories equals 0.

* improving interpretability of intercept
generate pyear_1989=p_year-1989
regress intvl_ln pyear_1989	

MTB > name C8 'pyear_1989'
MTB > Let 'pyear_1989' = 'p_year'-1989
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'pyear_1989';
SUBC>   Terms C8;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   TExpand;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus pyear_1989 

Analysis of Variance
Source           DF  Seq SS  Contribution  Adj SS   Adj MS  F-Value  P-Value
Regression        1   147.5         4.19%   147.5  147.547   131.00    0.000
  pyear_1989      1   147.5         4.19%   147.5  147.547   131.00    0.000
Error          2998  3376.8        95.81%  3376.8    1.126
  Lack-of-Fit    17   112.8         3.20%   112.8    6.638     6.06    0.000
  Pure Error   2981  3264.0        92.61%  3264.0    1.095
Total          2999  3524.4       100.00%

Model Summary
      S   R-sq  R-sq(adj)    PRESS  R-sq(pred)
1.06130  4.19%      4.15%  3380.97       4.07%

Coefficients
Term            Coef  SE Coef         95% CI         T-Value  P-Value   VIF
Constant      7.1440   0.0294  (  7.0863,   7.2018)   242.58    0.000
pyear_1989  -0.04784  0.00418  (-0.05604, -0.03965)   -11.45    0.000  1.00

Regression Equation
intv_ln = 7.1440 - 0.04784 pyear_1989


* manual computation of lack of fit test
scalar F=(3376.80672-3263.96797)/(2998-2981)/1.09492384
display "lack-of-fit F-test:  F=" F "   P-value=" Ftail(2998-2981,2981,F)

The lack of fit test was already computed above (listed in the ANOVA
table). The P-value could also be computed manually in Minitab as

MTB > CDF 6.0621326;
SUBC>   F 17 2981.
Cumulative Distribution Function 

F distribution with 17 DF in numerator and 2981 DF in denominator

      x  P( X <= x )
6.06213            1


* Q3: linearity of -p_year- vs -intvl_ln- relationship
twoway (scatter intvl_ln p_year, msize(vsmall)) (lowess intvl_ln p_year) 
generate pyear_sq = p_year^2
regress intvl_ln p_year pyear_sq

MTB > Plot 'intvl_ln'*'p_year';
SUBC>   Symbol;
SUBC>   Lowess.
Scatterplot of intvl_ln vs p_year 
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_year';
SUBC>   Terms C6 C6*C6;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_year 

Note: the quadratic term for p_year must be created in the Model
submenu.


* Q4: evaluating collinearity (VIFs)
regress intvl_ln p_year pyear_sq
estat vif
generate pyear_ct=p_year-1999
replace pyear_sq=pyear_ct^2
regress intvl_ln pyear_ct pyear_sq
estat vif

MTB > name C9 'pyear_ct'
MTB > Let 'pyear_ct' = 'p_year'-1999
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'pyear_ct';
SUBC>   Terms C9 C9*C9;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus pyear_ct 

Note: see above for the VIF in the uncentered model.


* Q5: evaluating confounding
regress intvl_ln hdsize	/* model without year */
regress intvl_ln pyear_ct pyear_sq hdsize /* model with year */
display "percent change=" abs(-.0002863-(-.0010355))/abs(-.0010355)*100
* added assessment of association hdsize vs year (should really adjust for herd)
regress hdsize p_year

MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'hdsize';
SUBC>   Terms hdsize;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus hdsize 
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'hdsize' 'pyear_ct';
SUBC>   Terms hdsize C9 C9*C9;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus hdsize, pyear_ct 
MTB > Regress;
SUBC>   Response 'hdsize';
SUBC>   Nodefault;
SUBC>   Continuous 'p_year';
SUBC>   Terms C6;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: hdsize versus p_year 


* Q6: interaction
egen pyear_c2=cut(p_year), at(0 1999 2999) icodes
tab pyear_c2 p_year /* just checking */
regress intvl_ln pyear_c2##c.p_rct

MTB > name c12 'pyear_c2'
MTB > Code (0:1998) 0 (1999:2999) 1 'p_year'  'pyear_c2'
MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct';
SUBC>   Categorical 'pyear_c2';
SUBC>   Terms C5 C10 C5*C10;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_rct, pyear_c2 

Note: the interaction between the two predictors must be created in the 
Model submenu.


* Q7: model building 
regress intvl_ln hdsize pyear_c2##c.p_rct
margins  pyear_c2, at(hdsize=50 p_rct=(0(10)40)) expression(exp(predict(xb)))
marginsplot

MTB > Regress;
SUBC>   Response 'intv_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'hdsize';
SUBC>   Categorical 'pyear_c2';
SUBC>   Terms C5 hdsize C10 C5*C10;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intv_ln versus p_rct, hdsize, pyear_c2 
MTB > Name c11 "newyear"
MTB > Set 'newyear'
DATA>   5( 0 : 1 / 1 )1
DATA>   End.
MTB > Name c12 "newrct"
MTB > Set 'newrct'
DATA>   1( 0 : 40 / 10 )2
DATA>   End.
MTB > Name c13 "newhds"
MTB > Set 'newhds'
DATA>   1( 50 : 50 / 1 )10
DATA>   End.
MTB > Name C14 "PFITS" C15 "PSEFITS" C16 "CLIM" C17 "CLIM_1" C18 "PLIM" C19 "PLIM_1".
MTB > Predict 'intv_ln';
SUBC>   Nodefault;
SUBC>   CPredictors newrct newhds newyear;
SUBC>   PFITS 'PFITS';
SUBC>   Psefits 'PSEFITS';
SUBC>   CLIM 'CLIM' - 'CLIM_1';
SUBC>   PLIM 'PLIM' - 'PLIM_1'.
Prediction for intv_ln 

This produces columns with predicted values as well as lower and upper
bounds for confidence and prediction intervals for the sets of predictor
values specified in columns c11-c13. These can be plotted on transformed 
scale or with some further manipulation on original scale.


Exercise 3:
-----------
* open the btb_episodes data set
use btb_episodes_data.dta, clear
* compute extra variables needed
generate intvl_ln=ln(intvl)
generate pyear_ct=p_year-2000
generate pyear_sq=pyear_ct^2
sort herd p_year
gen id=_n

MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv";
SUBC>   FType;
SUBC>     CSV;
SUBC>   DecSep;
SUBC>     Period;
SUBC>   Field;
SUBC>     Comma;
SUBC>   TDelimiter;
SUBC>     DoubleQuote.
Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’
Worksheet was saved on 07/01/2013

MTB > Sort;
SUBC>   By 'herd' 'p_year';
SUBC>   Original.
MTB > Name c7 "id"
MTB > Set 'id'
DATA>   1( 1 : 3000 / 1 )1
DATA>   End.
MTB > name C8 'intvl_ln'
MTB > Let 'intvl_ln' = ln('intvl')
MTB > name C9 'pyear_ct'
MTB > Let 'pyear_ct' = 'p_year'-2000
MTB > name C10 'pyear_sq'
MTB > Let 'pyear_sq' = 'pyear_ct'**2

* run model
regress intvl_ln p_rct hdsize pyear_ct pyear_sq

MTB > Regress;
SUBC>   Response 'intvl_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   Terms C5 hdsize C9 C10;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 


* Q1: evaluate homoscedasticity
estat hettest
estat imtest
* residual plot using standardised residuals
predict stdres, rstandard
predict fit, xb
scatter stdres fit

MTB > Name C11 "SRES".
MTB > Regress;
SUBC>   Response 'intvl_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   Terms C5 hdsize C9 C10;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Rtype 2;
SUBC>  GFOURPACK;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Sresiduals 'SRES'.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 

Note: Minitab does not offer tests for heteroscedasticity; this is no
big deficit because such tests are often less useful than graphical tools
based on the residuals.


* Q2: evaluate normality
* first graphically
histogram stdres, normal
qnorm stdres
summarize stdres, detail
* then with a formal test
swilk stdres

MTB > GSummary 'SRES'.
Summary Report for SRES 
MTB > NormTest 'SRES';
SUBC>   RJTest.
Probability Plot of SRES 

Note: The normal plot was produced above. Normal plots in Minitab differ
from those in Stata by having the y-axis and x-axis switched. Also,
Minitab does not offer Shapiro-Wilks test for normality, but the Ryan-Joiner 
test is "similar". P-values for normality tests on residuals can never 
be trusted exactly anyway because they require independent observations,
and residuals are not independent.


* Q3: attempts to correct the overall fit
* starting with the original -intvl- variable
* use Box-Cox analysis to identify the optimal transformation
boxcox intvl p_rct hdsize pyear_ct pyear_sq
...

MTB > Name C12 "SRES_1".
MTB > Regress;
SUBC>   Response 'intvl';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   Terms C5 hdsize C9 C10;
SUBC>   Constant;
SUBC>   Boxcox;
SUBC>   Unstandardized;
SUBC>   Rtype 2;
SUBC>  GFOURPACK;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Sresiduals 'SRES_1'.
Regression Analysis: intvl versus p_rct, hdsize, pyear_ct, pyear_sq 

Method
Rows unused             13
Box-Cox transformation
Rounded ?               0.164565
Estimated ?             0.164565
95% CI for ?            (0.130065, 0.199065)

Analysis of Variance for Transformed Response
Source           DF   Adj SS   Adj MS  F-Value  P-Value
Regression        4   65.259  16.3147    58.53    0.000
  p_rct           1    4.164   4.1638    14.94    0.000
  hdsize          1    1.198   1.1978     4.30    0.038
  pyear_ct        1   47.622  47.6220   170.84    0.000
  pyear_sq        1   17.183  17.1829    61.64    0.000
Error          2982  831.221   0.2787
  Lack-of-Fit  2434  665.365   0.2734     0.90    0.940
  Pure Error    548  165.855   0.3027
Total          2986  896.479

Model Summary for Transformed Response
       S   R-sq  R-sq(adj)  R-sq(pred)
0.527964  7.28%      7.16%       6.99%

Coefficients for Transformed Response
Term           Coef   SE Coef  T-Value  P-Value   VIF
Constant     3.0515    0.0223   136.97    0.000
p_rct       0.00680   0.00176     3.86    0.000  1.17
hdsize    -0.000462  0.000223    -2.07    0.038  1.20
pyear_ct   -0.05449   0.00417   -13.07    0.000  4.00
pyear_sq  -0.003734  0.000476    -7.85    0.000  3.99

Regression Equation
intvl^0.164565 = 3.0515 + 0.00680 p_rct - 0.000462 hdsize - 0.05449 pyear_ct
                 - 0.003734 pyear_sq
Residual Plots for intvl  

Note: The Box-Cox analysis results in a choice of the "preferable"
transformation, and a model is then fitted for the transformed variable.
In this case, the choice did not involve any rounding off (of lambda),
so the listing includes the strictly optimal transformation.

MTB > GSummary  'SRES_1'.
Summary Report for SRES_1 
MTB > NormTest 'SRES_1';
SUBC>   RJTest.
Probability Plot of SRES_1 


* Q4: diagnostics
* refit model
regress intvl_ln p_rct hdsize pyear_ct pyear_sq
* examine detailed aspects of fit for the original model
* generate necessary residuals and fit statistics
capture drop fit stdres
predict fit, xb
predict stdres, rstandard
predict delres, rstudent
predict cook, cooksd
predict lev, leverage
predict dfit, dfit

MTB > Name C13 "SRES_2" C14 "TRES" C15 "HI" C16 "COOK" C17 "DFIT".
MTB > Regress;
SUBC>   Response 'intvl_ln';
SUBC>   Nodefault;
SUBC>   Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq';
SUBC>   Terms C5 hdsize C9 C10;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Rtype 2;
SUBC>  GFOURPACK;
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   Sresiduals 'SRES_2';
SUBC>   Tresiduals 'TRES';
SUBC>   Hi 'HI';
SUBC>   Cookd 'COOK';
SUBC>   Dfits 'DFIT'.
Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq 

Note: Minitab does not offer dfbeta statistics. The list of "extreme"
residuals (defined as abs(stdres)>=2) and leverages has been turned off
because for a large dataset, the list becomes very long and mostly a
nuisance.

* largest outlier(s)
summarize stdres, d
sort stdres
list id intvl_ln p_rct hdsize p_year fit stdres delres if abs(stdres)>2.7 & stdres~=., noobs

MTB > Describe 'SRES_2' 'TRES' 'HI' 'COOK' 'DFIT';
SUBC>   Mean;
SUBC>   StDeviation;
SUBC>   Median;
SUBC>   Minimum;
SUBC>   Maximum;
SUBC>   N.
Descriptive Statistics: SRES_2, TRES, HI, COOK, DFIT 

Note: Minitab does not allow easy listing of observations subject to
certain conditions. One option is to subset the dataset suitably, e.g. as
follows:

MTB > Subset;
SUBC>   Where "abs('SRES3')>2.5  And abs('SRES3') <> '*'";
SUBC>   Name "Subset of hs02_6.csv";
SUBC>   NoMatrices;
SUBC>   NoConstants;
SUBC>   Include.
Subset worksheet Subset of hs02_6.csv created.

* outlier test based on deletion residuals
display 2*nobs*ttail(nobs-nparam-1, 2.967837) /* P=1, so NS */
display invttail(nobs-nparam-1,.025/nobs)

Note: These tests can be computed semi-manually in Minitab as well. To
illustrate, we compute the cut-off for significance of the test (using
DF=2987-6=2981 and p=0.025/2987=0.00000837)

MTB > InvCDF .00000837;
SUBC>   T 2981.
Inverse Cumulative Distribution Function 

* compute threshold values
display "leverage cutoff:  " 2*nparam/nobs
display "conservative leverage cutoff:  " 3*nparam/nobs
display "Cook's D cutoff:   " 4/nobs
display "DFITS cutoff:   " 2*sqrt(nparam/nobs)*(nobs>=120)+1*(nobs<120)
display "DFBETA cutoff:   " 2/sqrt(nobs)*(nobs>=120)+1*(nobs<120)

MTB > let k1=2*5/2987
MTB > name k1 'thres_lev'
MTB > let k2=3*5/2987
MTB > name k2 'thres2_lev'
MTB > let k3=4/2987
MTB > name k3 'thres_cookd'
MTB > let k4=2*sqrt(5/2987)*(2987>=120)+(2987<120)
MTB > Print 'thres_lev' 'thres2_lev' 'thres_cookd' 'thres_dfits'.
Data Display 

Note: One could also introduce constants 'nobs' and 'nparam' and use
those in the formulae.

* refit the model two ways
* first, leave out cooksd > 0.007 (n=5)
regress intvl_ln p_rct hdsize pyear_ct pyear_sq if cook<=0.007
* R2 goes up slightly, coef. don't change
* second, leave out herds with high values of p_rct
regress intvl_ln p_rct hdsize pyear_ct pyear_sq if p_rct<=50
* very little change in model

* ADVANCED MATERIAL - suggest using robust standard errors (Chapter 20)
* to counteract problems of heteroscedasticity and non-normality
regress intvl_ln p_rct hdsize pyear_ct pyear_sq, robust

Note: Minitab does not offer robust standard errors, but this is indeed
an advanced topic and such procedures should be used with caution.