Solution file for additional exercise 2.6 ----------------------------------------- The purpose of this solution file is to demonstrate and comment on Minitab commands that correspond to the Stata solutions presented in VHM 812 for linear regression exercises 1-3. Output is only included for demonstration of differences between Stata and Minitab. Exercise 1: ----------- * open the btb_episodes data set use btb_episodes.dta, clear MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’ Worksheet was saved on 07/01/2013 * Q1a: distribution of outcome -intvl- codebook intvl histogram intvl MTB > GSummary 'intvl'. Summary for intvl * Q1b: natural log transformed outcome generate intvl_ln=ln(intvl) histogram intvl_ln MTB > Name C7 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') MTB > GSummary 'intvl_ln'. Summary for intvl_ln * Q2: simple linear regressions regress intvl_ln p_rct regress intvl_ln p_year regress intvl_ln hdsize MTB > Fitline 'intv_ln' 'p_rct'; SUBC> Confidence 95.0. Regression Analysis: intv_ln versus p_rct MTB > Fitline 'intv_ln' 'p_year'; SUBC> Confidence 95.0. Regression Analysis: intv_ln versus p_year MTB > Fitline 'intv_ln' 'hdsize'; SUBC> Confidence 95.0. Regression Analysis: intv_ln versus hdsize * Q3: multiple linear regression regress intvl_ln p_rct p_year hdsize MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'p_year' 'hdsize'; SUBC> Terms C5 C6 hdsize; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_rct, p_year, hdsize * added calculation of correlation between p_rct and hdsize pwcorr p_rct hdsize MTB > Correlation 'p_rct' 'hdsize'. Correlations: p_rct, hdsize regress intvl_ln hdsize estimates store smpl regress intvl_ln p_rct p_year hdsize estimates store mltpl estimates table smpl mltpl Note: It is possible in Minitab to store estimates in a column in the worksheet, but any further processing needs to be done manually. * Q4: predicion intervals for means and individuals (based only on hdsize) regress intvl_ln hdsize * doing the calculations yourself predict pv, xb predict pv_mean_se, stdp scalar tstar=invttail(2985,.025) /* using DFE */ generate pv_mean_u=pv + tstar*pv_mean_se generate pv_mean_l=pv - tstar*pv_mean_se predict pv_ind_se, stdf generate pv_ind_u=pv + tstar*pv_ind_se generate pv_ind_l=pv - tstar*pv_ind_se twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) /// (line pv_mean_u hdsize) (line pv_mean_l hdsize) /// (line pv_ind_u hdsize) (line pv_ind_l hdsize) * using the built in graphics capabilities (only simple linear regression) twoway (scatter intvl_ln hdsize, sort msize(vsmall)) /// (lfitci intvl_ln hdsize, ciplot(rline)) /// (lfitci intvl_ln hdsize, stdf ciplot(rline)) MTB > Fitline 'intvl_ln' 'hdsize'; SUBC> Confidence 95.0; SUBC> Ci; SUBC> Pi. Regression Analysis: intvl_ln versus hdsize MTB > Subset; SUBC> Exclude; SUBC> MExclude; SUBC> Where "hdsize EQ miss()"; SUBC> Name "hdsize equals (Missing) excluded". Subset worksheet hdsize equals (Missing) excluded created. Results for: hdsize equals (Missing) excluded MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'hdsize'; SUBC> Terms hdsize; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus hdsize MTB > Name C8 "PFITS" C9 "PSEFITS" C10 "CLIM" C11 "CLIM_1" C12 "PLIM" C13 "PLIM_1". MTB > Predict 'intv_ln'; SUBC> Nodefault; SUBC> CPredictors hdsize; SUBC> PFITS 'PFITS'; SUBC> Psefits 'PSEFITS'; SUBC> CLIM 'CLIM' - 'CLIM_1'; SUBC> PLIM 'PLIM' - 'PLIM_1'. Prediction for intv_ln Note: The prediction menu does not work when the column for which predictions are requested contains missing values. Therefore, the above commands were run on the dataset excluding missing values for hdsize. * Q5: R2 regress intvl_ln p_year hdsize regress intvl_ln p_rct p_year hdsize Results for: btb_episodes.csv MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_year' 'hdsize'; SUBC> Terms C6 hdsize; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_year, hdsize Note: The listings above already displayed the R2-value for the second model. Exercise 2: ----------- * open the btb_episodes data set use btb_episodes.dta, clear MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’ Worksheet was saved on 07/01/2013 * compute natural log transformed outcome generate intvl_ln=ln(intvl) MTB > Name C7 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') * Q2: -p_year- as sole predictor regress intvl_ln p_year /* continuous predictor */ regress intvl_ln i.p_year /* categorical predictor */ MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_year'; SUBC> Terms C6; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_year MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Categorical 'p_year'; SUBC> Terms C6; SUBC> Constant; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_year Note: In order to get the model parametrized in the same way as in Stata with the first category as the reference level, one needs to have the "Coding for categorical predictors" as (1,0), in the Coding submenu; this is the default in the Regression menu, but not in the General Linear Model menu. With the alternative coding (-1,0,+1), the parameters will be determined from the restriction that the sum of all parameters across categories equals 0. * improving interpretability of intercept generate pyear_1989=p_year-1989 regress intvl_ln pyear_1989 MTB > name C8 'pyear_1989' MTB > Let 'pyear_1989' = 'p_year'-1989 MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'pyear_1989'; SUBC> Terms C8; SUBC> Constant; SUBC> Unstandardized; SUBC> TExpand; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus pyear_1989 Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value Regression 1 147.5 4.19% 147.5 147.547 131.00 0.000 pyear_1989 1 147.5 4.19% 147.5 147.547 131.00 0.000 Error 2998 3376.8 95.81% 3376.8 1.126 Lack-of-Fit 17 112.8 3.20% 112.8 6.638 6.06 0.000 Pure Error 2981 3264.0 92.61% 3264.0 1.095 Total 2999 3524.4 100.00% Model Summary S R-sq R-sq(adj) PRESS R-sq(pred) 1.06130 4.19% 4.15% 3380.97 4.07% Coefficients Term Coef SE Coef 95% CI T-Value P-Value VIF Constant 7.1440 0.0294 ( 7.0863, 7.2018) 242.58 0.000 pyear_1989 -0.04784 0.00418 (-0.05604, -0.03965) -11.45 0.000 1.00 Regression Equation intv_ln = 7.1440 - 0.04784 pyear_1989 * manual computation of lack of fit test scalar F=(3376.80672-3263.96797)/(2998-2981)/1.09492384 display "lack-of-fit F-test: F=" F " P-value=" Ftail(2998-2981,2981,F) The lack of fit test was already computed above (listed in the ANOVA table). The P-value could also be computed manually in Minitab as MTB > CDF 6.0621326; SUBC> F 17 2981. Cumulative Distribution Function F distribution with 17 DF in numerator and 2981 DF in denominator x P( X <= x ) 6.06213 1 * Q3: linearity of -p_year- vs -intvl_ln- relationship twoway (scatter intvl_ln p_year, msize(vsmall)) (lowess intvl_ln p_year) generate pyear_sq = p_year^2 regress intvl_ln p_year pyear_sq MTB > Plot 'intvl_ln'*'p_year'; SUBC> Symbol; SUBC> Lowess. Scatterplot of intvl_ln vs p_year MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_year'; SUBC> Terms C6 C6*C6; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_year Note: the quadratic term for p_year must be created in the Model submenu. * Q4: evaluating collinearity (VIFs) regress intvl_ln p_year pyear_sq estat vif generate pyear_ct=p_year-1999 replace pyear_sq=pyear_ct^2 regress intvl_ln pyear_ct pyear_sq estat vif MTB > name C9 'pyear_ct' MTB > Let 'pyear_ct' = 'p_year'-1999 MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'pyear_ct'; SUBC> Terms C9 C9*C9; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus pyear_ct Note: see above for the VIF in the uncentered model. * Q5: evaluating confounding regress intvl_ln hdsize /* model without year */ regress intvl_ln pyear_ct pyear_sq hdsize /* model with year */ display "percent change=" abs(-.0002863-(-.0010355))/abs(-.0010355)*100 * added assessment of association hdsize vs year (should really adjust for herd) regress hdsize p_year MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'hdsize'; SUBC> Terms hdsize; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus hdsize MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'hdsize' 'pyear_ct'; SUBC> Terms hdsize C9 C9*C9; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus hdsize, pyear_ct MTB > Regress; SUBC> Response 'hdsize'; SUBC> Nodefault; SUBC> Continuous 'p_year'; SUBC> Terms C6; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: hdsize versus p_year * Q6: interaction egen pyear_c2=cut(p_year), at(0 1999 2999) icodes tab pyear_c2 p_year /* just checking */ regress intvl_ln pyear_c2##c.p_rct MTB > name c12 'pyear_c2' MTB > Code (0:1998) 0 (1999:2999) 1 'p_year' 'pyear_c2' MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct'; SUBC> Categorical 'pyear_c2'; SUBC> Terms C5 C10 C5*C10; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_rct, pyear_c2 Note: the interaction between the two predictors must be created in the Model submenu. * Q7: model building regress intvl_ln hdsize pyear_c2##c.p_rct margins pyear_c2, at(hdsize=50 p_rct=(0(10)40)) expression(exp(predict(xb))) marginsplot MTB > Regress; SUBC> Response 'intv_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'hdsize'; SUBC> Categorical 'pyear_c2'; SUBC> Terms C5 hdsize C10 C5*C10; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intv_ln versus p_rct, hdsize, pyear_c2 MTB > Name c11 "newyear" MTB > Set 'newyear' DATA> 5( 0 : 1 / 1 )1 DATA> End. MTB > Name c12 "newrct" MTB > Set 'newrct' DATA> 1( 0 : 40 / 10 )2 DATA> End. MTB > Name c13 "newhds" MTB > Set 'newhds' DATA> 1( 50 : 50 / 1 )10 DATA> End. MTB > Name C14 "PFITS" C15 "PSEFITS" C16 "CLIM" C17 "CLIM_1" C18 "PLIM" C19 "PLIM_1". MTB > Predict 'intv_ln'; SUBC> Nodefault; SUBC> CPredictors newrct newhds newyear; SUBC> PFITS 'PFITS'; SUBC> Psefits 'PSEFITS'; SUBC> CLIM 'CLIM' - 'CLIM_1'; SUBC> PLIM 'PLIM' - 'PLIM_1'. Prediction for intv_ln This produces columns with predicted values as well as lower and upper bounds for confidence and prediction intervals for the sets of predictor values specified in columns c11-c13. These can be plotted on transformed scale or with some further manipulation on original scale. Exercise 3: ----------- * open the btb_episodes data set use btb_episodes_data.dta, clear * compute extra variables needed generate intvl_ln=ln(intvl) generate pyear_ct=p_year-2000 generate pyear_sq=pyear_ct^2 sort herd p_year gen id=_n MTB > WOpen "H:\VHM\VHM802\Data_csv\btb_episodes.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\btb_episodes.csv’ Worksheet was saved on 07/01/2013 MTB > Sort; SUBC> By 'herd' 'p_year'; SUBC> Original. MTB > Name c7 "id" MTB > Set 'id' DATA> 1( 1 : 3000 / 1 )1 DATA> End. MTB > name C8 'intvl_ln' MTB > Let 'intvl_ln' = ln('intvl') MTB > name C9 'pyear_ct' MTB > Let 'pyear_ct' = 'p_year'-2000 MTB > name C10 'pyear_sq' MTB > Let 'pyear_sq' = 'pyear_ct'**2 * run model regress intvl_ln p_rct hdsize pyear_ct pyear_sq MTB > Regress; SUBC> Response 'intvl_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> Terms C5 hdsize C9 C10; SUBC> Constant; SUBC> Unstandardized; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq * Q1: evaluate homoscedasticity estat hettest estat imtest * residual plot using standardised residuals predict stdres, rstandard predict fit, xb scatter stdres fit MTB > Name C11 "SRES". MTB > Regress; SUBC> Response 'intvl_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> Terms C5 hdsize C9 C10; SUBC> Constant; SUBC> Unstandardized; SUBC> Rtype 2; SUBC> GFOURPACK; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Sresiduals 'SRES'. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq Note: Minitab does not offer tests for heteroscedasticity; this is no big deficit because such tests are often less useful than graphical tools based on the residuals. * Q2: evaluate normality * first graphically histogram stdres, normal qnorm stdres summarize stdres, detail * then with a formal test swilk stdres MTB > GSummary 'SRES'. Summary Report for SRES MTB > NormTest 'SRES'; SUBC> RJTest. Probability Plot of SRES Note: The normal plot was produced above. Normal plots in Minitab differ from those in Stata by having the y-axis and x-axis switched. Also, Minitab does not offer Shapiro-Wilks test for normality, but the Ryan-Joiner test is "similar". P-values for normality tests on residuals can never be trusted exactly anyway because they require independent observations, and residuals are not independent. * Q3: attempts to correct the overall fit * starting with the original -intvl- variable * use Box-Cox analysis to identify the optimal transformation boxcox intvl p_rct hdsize pyear_ct pyear_sq ... MTB > Name C12 "SRES_1". MTB > Regress; SUBC> Response 'intvl'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> Terms C5 hdsize C9 C10; SUBC> Constant; SUBC> Boxcox; SUBC> Unstandardized; SUBC> Rtype 2; SUBC> GFOURPACK; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Sresiduals 'SRES_1'. Regression Analysis: intvl versus p_rct, hdsize, pyear_ct, pyear_sq Method Rows unused 13 Box-Cox transformation Rounded ? 0.164565 Estimated ? 0.164565 95% CI for ? (0.130065, 0.199065) Analysis of Variance for Transformed Response Source DF Adj SS Adj MS F-Value P-Value Regression 4 65.259 16.3147 58.53 0.000 p_rct 1 4.164 4.1638 14.94 0.000 hdsize 1 1.198 1.1978 4.30 0.038 pyear_ct 1 47.622 47.6220 170.84 0.000 pyear_sq 1 17.183 17.1829 61.64 0.000 Error 2982 831.221 0.2787 Lack-of-Fit 2434 665.365 0.2734 0.90 0.940 Pure Error 548 165.855 0.3027 Total 2986 896.479 Model Summary for Transformed Response S R-sq R-sq(adj) R-sq(pred) 0.527964 7.28% 7.16% 6.99% Coefficients for Transformed Response Term Coef SE Coef T-Value P-Value VIF Constant 3.0515 0.0223 136.97 0.000 p_rct 0.00680 0.00176 3.86 0.000 1.17 hdsize -0.000462 0.000223 -2.07 0.038 1.20 pyear_ct -0.05449 0.00417 -13.07 0.000 4.00 pyear_sq -0.003734 0.000476 -7.85 0.000 3.99 Regression Equation intvl^0.164565 = 3.0515 + 0.00680 p_rct - 0.000462 hdsize - 0.05449 pyear_ct - 0.003734 pyear_sq Residual Plots for intvl Note: The Box-Cox analysis results in a choice of the "preferable" transformation, and a model is then fitted for the transformed variable. In this case, the choice did not involve any rounding off (of lambda), so the listing includes the strictly optimal transformation. MTB > GSummary 'SRES_1'. Summary Report for SRES_1 MTB > NormTest 'SRES_1'; SUBC> RJTest. Probability Plot of SRES_1 * Q4: diagnostics * refit model regress intvl_ln p_rct hdsize pyear_ct pyear_sq * examine detailed aspects of fit for the original model * generate necessary residuals and fit statistics capture drop fit stdres predict fit, xb predict stdres, rstandard predict delres, rstudent predict cook, cooksd predict lev, leverage predict dfit, dfit MTB > Name C13 "SRES_2" C14 "TRES" C15 "HI" C16 "COOK" C17 "DFIT". MTB > Regress; SUBC> Response 'intvl_ln'; SUBC> Nodefault; SUBC> Continuous 'p_rct' 'hdsize' 'pyear_ct' 'pyear_sq'; SUBC> Terms C5 hdsize C9 C10; SUBC> Constant; SUBC> Unstandardized; SUBC> Rtype 2; SUBC> GFOURPACK; SUBC> Tmethod; SUBC> Tanova; SUBC> Tsummary; SUBC> Tcoefficients; SUBC> Tequation; SUBC> Sresiduals 'SRES_2'; SUBC> Tresiduals 'TRES'; SUBC> Hi 'HI'; SUBC> Cookd 'COOK'; SUBC> Dfits 'DFIT'. Regression Analysis: intvl_ln versus p_rct, hdsize, pyear_ct, pyear_sq Note: Minitab does not offer dfbeta statistics. The list of "extreme" residuals (defined as abs(stdres)>=2) and leverages has been turned off because for a large dataset, the list becomes very long and mostly a nuisance. * largest outlier(s) summarize stdres, d sort stdres list id intvl_ln p_rct hdsize p_year fit stdres delres if abs(stdres)>2.7 & stdres~=., noobs MTB > Describe 'SRES_2' 'TRES' 'HI' 'COOK' 'DFIT'; SUBC> Mean; SUBC> StDeviation; SUBC> Median; SUBC> Minimum; SUBC> Maximum; SUBC> N. Descriptive Statistics: SRES_2, TRES, HI, COOK, DFIT Note: Minitab does not allow easy listing of observations subject to certain conditions. One option is to subset the dataset suitably, e.g. as follows: MTB > Subset; SUBC> Where "abs('SRES3')>2.5 And abs('SRES3') <> '*'"; SUBC> Name "Subset of hs02_6.csv"; SUBC> NoMatrices; SUBC> NoConstants; SUBC> Include. Subset worksheet Subset of hs02_6.csv created. * outlier test based on deletion residuals display 2*nobs*ttail(nobs-nparam-1, 2.967837) /* P=1, so NS */ display invttail(nobs-nparam-1,.025/nobs) Note: These tests can be computed semi-manually in Minitab as well. To illustrate, we compute the cut-off for significance of the test (using DF=2987-6=2981 and p=0.025/2987=0.00000837) MTB > InvCDF .00000837; SUBC> T 2981. Inverse Cumulative Distribution Function * compute threshold values display "leverage cutoff: " 2*nparam/nobs display "conservative leverage cutoff: " 3*nparam/nobs display "Cook's D cutoff: " 4/nobs display "DFITS cutoff: " 2*sqrt(nparam/nobs)*(nobs>=120)+1*(nobs<120) display "DFBETA cutoff: " 2/sqrt(nobs)*(nobs>=120)+1*(nobs<120) MTB > let k1=2*5/2987 MTB > name k1 'thres_lev' MTB > let k2=3*5/2987 MTB > name k2 'thres2_lev' MTB > let k3=4/2987 MTB > name k3 'thres_cookd' MTB > let k4=2*sqrt(5/2987)*(2987>=120)+(2987<120) MTB > Print 'thres_lev' 'thres2_lev' 'thres_cookd' 'thres_dfits'. Data Display Note: One could also introduce constants 'nobs' and 'nparam' and use those in the formulae. * refit the model two ways * first, leave out cooksd > 0.007 (n=5) regress intvl_ln p_rct hdsize pyear_ct pyear_sq if cook<=0.007 * R2 goes up slightly, coef. don't change * second, leave out herds with high values of p_rct regress intvl_ln p_rct hdsize pyear_ct pyear_sq if p_rct<=50 * very little change in model * ADVANCED MATERIAL - suggest using robust standard errors (Chapter 20) * to counteract problems of heteroscedasticity and non-normality regress intvl_ln p_rct hdsize pyear_ct pyear_sq, robust Note: Minitab does not offer robust standard errors, but this is indeed an advanced topic and such procedures should be used with caution.