Solution file for additional exercise 10.7 ------------------------------------------ Data on excretion of salsolinol measured over 4 consecutive days, for alcoholized persons, divided into two groups based on degree of addiction. Group 1 contains 6 persons, and group 2 contains 8 persons. Note that the datafile codes persons as 1-6 in group 1 and 1-8 in group 2; you might want to create unique person id's although that is not strictly required for analysis in Minitab. These are repeated measures or longitudinal data because measurements are taken on the same person (experimental unit), and in this case over time. Notation: y_ijk = salsolinol value for patient j in patient group i measured at day k, i = 1,2 (group), j = 1,..., n_i (patient; n_1=6, n_2=8), k = 1,2,3,4 (day), The first step is to construct profile plots over time for each person. MTB > WOpen "H:\VHM\VHM802\Data_csv\hs10_7.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs10_7.csv’ Worksheet was saved on 03/03/2011 MTB > Plot 'salsolinol'*'day'; SUBC> Symbol 'person'; SUBC> Connect 'person'; SUBC> Panel 'group'. Scatterplot of salsolinol vs day Comments: --------- The plot seems to indicate more variability for higher values of the outcome than for lower values. To confirm this first impression, we fit a split-plot model with persons as replicates within groups, and look at the residuals. MTB > GLM; SUBC> Response 'salsolinol'; SUBC> Nodefault; SUBC> Categorical 'group' 'person' 'day'; SUBC> Nest person(group); SUBC> Random person; SUBC> Terms group day person group*day; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TFactor; SUBC> TEMS; SUBC> TVariance; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> GFOURPACK. General Linear Model: salsolinol versus group, person, day Method Factor coding (-1, 0, +1) Factor Information Factor Type Levels Values group Fixed 2 1, 2 person(group) Random 14 1(1), 2(1), 3(1), 4(1), 5(1), 6(1), 1(2), 2(2), 3(2), 4(2), 5(2), 6(2), 7(2), 8(2) day Fixed 4 1, 2, 3, 4 Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value group 1 6.565 2.79% 6.565 6.565 1.95 0.188 day 3 14.888 6.32% 11.723 3.908 0.85 0.476 person(group) 12 40.329 17.12% 40.329 3.361 0.73 0.713 group*day 3 8.029 3.41% 8.029 2.676 0.58 0.631 Error 36 165.692 70.36% 165.692 4.603 Total 55 235.503 100.00% Model Summary S R-sq R-sq(adj) PRESS R-sq(pred) 2.14536 29.64% 0.00% 398.019 0.00% Fits and Diagnostics for Unusual Observations Std Del Obs salsolinol Fit SE Fit 95% CI Resid Resid Resid HI Cook’s D DFITS 22 6.34 2.82 1.31 (0.16, 5.49) 3.52 2.07 2.18 0.37500 0.13 1.68713 R 45 7.80 2.76 1.26 (0.21, 5.32) 5.04 2.90 3.26 0.34375 0.22 2.36129 R 56 8.10 3.77 1.26 (1.22, 6.32) 4.33 2.49 2.70 0.34375 0.16 1.95545 R R Large residual Expected Mean Squares, using Adjusted SS Expected Mean Square for Source Each Term 1 group (5) + 4.0000 (3) + Q[1, 4] 2 day (5) + Q[2, 4] 3 person(group) (5) + 4.0000 (3) 4 group*day (5) + Q[4] 5 Error (5) Variance Components, using Adjusted SS Source Variance % of Total StDev % of Total person(group) -0.310441* 0.00% 0.00000 0.00% Error 4.60255 100.00% 2.14536 100.00% Total 4.60255 2.14536 * Value is negative, and is estimated by zero. Residual Plots for salsolinol Comments: --------- The plot of residuals vs fitted values shows a clear cone or fan shape (increasing variation with increasing fitted values). Also, the residuals extend further on the positive than the negative side, and the histogram indicates a right-skewed distribution. We try a log-transformation (below), and the residuals look much better. It may of interest to explore which transformations are best for these data, but we will here work only with the logs. The boxcox command in Stata and the General Linear Model menu (below) are only for fixed effects models, but may nevertheless give an indication of a suitable transformation, and this analysis also suggests a log transformation, with an optimal lambda of -0.16 and hence fairly close to 0 and with 0 included in the 95% CI. MTB > GLM; SUBC> Response 'salsolinol'; SUBC> Nodefault; SUBC> Categorical 'group' 'person' 'day'; SUBC> Nest person(group); SUBC> Terms group day person group*day; SUBC> Boxcox; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TFactor; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> GFOURPACK. General Linear Model: salsolinol versus group, person, day Box-Cox transformation Rounded ? 0 Estimated ? -0.155716 95% CI for ? (-0.466216, 0.153784) ... MTB > Name C5 'lnsals' MTB > Let 'lnsals' = ln('salsolinol') MTB > Plot 'lnsals'*'day'; SUBC> Symbol 'person'; SUBC> Connect 'person'; SUBC> Panel 'group'. Scatterplot of lnsals vs day Comments: --------- The profiles now appear equally noisy across the range of lnsals values. There are no obvious person effects, and nor are average effects of group or day clearly seen. The results for the split-plot model are shown further below; the model checks look much better now. As our first analysis, we analyse the data separately for each day. This requires the data structure to be changed from long to wide. MTB > Unstack ('lnsals' 'group'); SUBC> Subscripts 'day'; SUBC> NewWS; SUBC> VarNames. Results for: Worksheet 2 MTB > OneWay; SUBC> Response 'lnsals_1'; SUBC> Categorical 'group_1'; SUBC> IType 0; SUBC> GMCI; SUBC> TMethod; SUBC> TFactor; SUBC> TANOVA; SUBC> TSummary; SUBC> TMeans; SUBC> Nodefault. One-way ANOVA: lnsals_1 versus group_1 Method Null hypothesis All means are equal Alternative hypothesis At least one mean is different Significance level a = 0.05 Equal variances were assumed for the analysis. Factor Information Factor Levels Values group_1 2 1, 2 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value group_1 1 0.1738 0.1738 0.15 0.708 Error 12 14.2197 1.1850 Total 13 14.3935 Model Summary S R-sq R-sq(adj) R-sq(pred) 1.08856 1.21% 0.00% 0.00% Means group_1 N Mean StDev 95% CI 1 6 -0.110 1.186 (-1.078, 0.859) 2 8 0.116 1.013 (-0.723, 0.954) Pooled StDev = 1.08856 Results for time points 2-4, using similar commands: --------------------------------------------------- One-way ANOVA: lnsals_2 versus group_1 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value group_1 1 0.06873 0.06873 0.09 0.765 Error 12 8.79155 0.73263 Total 13 8.86028 S R-sq R-sq(adj) R-sq(pred) 0.855938 0.78% 0.00% 0.00% Means group_1 N Mean StDev 95% CI 1 6 0.103 1.046 (-0.659, 0.864) 2 8 0.244 0.689 (-0.415, 0.904) Pooled StDev = 0.855938 One-way ANOVA: lnsals_3 versus group_1 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value group_1 1 0.4618 0.4618 0.76 0.399 Error 12 7.2593 0.6049 Total 13 7.7211 S R-sq R-sq(adj) R-sq(pred) 0.777781 5.98% 0.00% 0.00% Means group_1 N Mean StDev 95% CI 1 6 0.356 0.705 (-0.336, 1.047) 2 8 0.723 0.826 ( 0.123, 1.322) Pooled StDev = 0.777781 One-way ANOVA: lnsals_4 versus group_1 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value group_1 1 1.214 1.2142 1.79 0.206 Error 12 8.140 0.6783 Total 13 9.354 S R-sq R-sq(adj) R-sq(pred) 0.823606 12.98% 5.73% 0.00% Means group_1 N Mean StDev 95% CI 1 6 0.441 0.789 (-0.291, 1.174) 2 8 1.036 0.848 ( 0.402, 1.671) Pooled StDev = 0.823606 Comments: --------- The 4 separate analyses for each day show no significant differences between the two groups. Only the last one (day 4) gets even close to something interesting. There is no sign of problems with heterogeneous variances. Instead of the one-way ANOVA, we could also have used 2-sample t-tests. The profile plot (above) does not suggest any obvious response features. We therefore drop this part here. The split-plot model may be appropriate for the relatively short series of 4 measures per person. It takes * persons = whole plots, * groups = whole plot factor, * dayly measurements = sub-plots, * day = sub-plot factor, * no blocks - the persons are replications within groups. Another way of say this is that the data may be viewed as having hierarchical structure with measurements within persons. The statistical model is y_ijk = mu + alpha_i + (AB)_ij + gamma_k + (alpha gamma)_ik + eps_ijk, where AB_ij's are assumed i.i.d. N(0,sigma_AB^2), and where eps_ijk's are assumed i.i.d. N(0,sigma^2), Results for: hs10_7.csv MTB > Name C6 "SRES". MTB > GLM; SUBC> Response 'lnsals'; SUBC> Nodefault; SUBC> Categorical 'group' 'person' 'day'; SUBC> Nest person(group); SUBC> Random person; SUBC> Terms group day person group*day; SUBC> Means group day group*day; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TFactor; SUBC> TEMS; SUBC> TVariance; SUBC> TMeans; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> GFOURPACK; SUBC> SResiduals 'SRES'. General Linear Model: lnsals versus group, person, day Factor Information Factor Type Levels Values group Fixed 2 1, 2 person(group) Random 14 1(1), 2(1), 3(1), 4(1), 5(1), 6(1), 1(2), 2(2), 3(2), 4(2), 5(2), 6(2), 7(2), 8(2) day Fixed 4 1, 2, 3, 4 Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value group 1 1.5136 3.33% 1.5136 1.5136 2.48 0.141 day 3 5.0968 11.22% 4.6327 1.5442 1.79 0.167 person(group) 12 7.3142 16.10% 7.3142 0.6095 0.71 0.736 group*day 3 0.4050 0.89% 0.4050 0.1350 0.16 0.925 Error 36 31.0963 68.46% 31.0963 0.8638 Total 55 45.4258 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.929401 31.54% 0.00% 76.1600 0.00% Fits and Diagnostics for Unusual Observations Std Del Obs lnsals Fit SE Fit 95% CI Resid Resid Resid HI Cook’s D DFITS 5 1.668 0.141 0.569 (-1.013, 1.296) 1.526 2.08 2.18 0.37500 0.13 1.69134 R 22 1.847 0.296 0.569 (-0.858, 1.451) 1.551 2.11 2.22 0.37500 0.13 1.72182 R 45 2.054 0.531 0.545 (-0.574, 1.636) 1.523 2.02 2.12 0.34375 0.11 1.53372 R R Large residual Expected Mean Squares, using Adjusted SS Expected Mean Square for Source Each Term 1 group (5) + 4.0000 (3) + Q[1, 4] 2 day (5) + Q[2, 4] 3 person(group) (5) + 4.0000 (3) 4 group*day (5) + Q[4] 5 Error (5) Means Term Fitted Mean group 1 0.19750 2 0.52971 day 1 0.00297 2 0.17350 3 0.53913 4 0.73883 group*day 1 1 -0.10961 1 2 0.10271 1 3 0.35563 1 4 0.44128 2 1 0.11555 2 2 0.24429 2 3 0.72263 2 4 1.03638 Variance Components, using Adjusted SS Source Variance % of Total StDev % of Total person(group) -0.0635683* 0.00% 0.000000 0.00% Error 0.863786 100.00% 0.929401 100.00% Total 0.863786 0.929401 * Value is negative, and is estimated by zero. Residual Plots for lnsals MTB > FacPlot 'lnsals'; SUBC> Factors group day; SUBC> GInt; SUBC> Full. Interaction Plot for lnsals MTB > NormTest 'SRES'. Probability Plot of SRES The P-value for the Anderson-Darling test of normality is 0.211. MTB > Name c9 "ByVar1" c10 "ByVar2" c11 "Mean1" MTB > Statistics 'lnsals'; SUBC> By 'group' 'person'; SUBC> NoEmpty; SUBC> GValues 'ByVar1'-'ByVar2'; SUBC> Mean 'Mean1'. MTB > Name C10 "RESI". MTB > OneWay; SUBC> Response 'Mean1'; SUBC> Categorical 'ByVar1'; SUBC> IType 0; SUBC> GMCI; SUBC> GFourpack; SUBC> TMethod; SUBC> TFactor; SUBC> TANOVA; SUBC> TSummary; SUBC> TMeans; SUBC> Residuals 'RESI'; SUBC> Nodefault. One-way ANOVA: Mean1 versus ByVar1 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value ByVar1 1 0.3784 0.3784 2.48 0.141 Error 12 1.8285 0.1524 Total 13 2.2069 S R-sq R-sq(adj) R-sq(pred) 0.390356 17.15% 10.24% 0.00% Means ByVar1 N Mean StDev 95% CI 1 6 0.198 0.273 (-0.150, 0.545) 2 8 0.530 0.456 ( 0.229, 0.830) Pooled StDev = 0.390356 Residual Plots for Mean1 MTB > NormTest 'RESI'. Probability Plot of RESI The P-value for the Anderson-Darling test of normality is 0.661. Comments: --------- The residuals look nice both when plotted against the fitted values and in the normal plot, and there are no alarmingly extreme values. This is true both for the sub-plot and the whole plot residuals, even if the latter seem to have somewhat more variation in group 2 than in group 1. The ANOVA table shows most effects to be insignificant. The whole-plot variation seems to be very small (negative estimated variance component which should therefore be set to 0), but we keep the whole-plots in the model anyway. The group*day interaction seems to be very small - presumably a disappointment for the researchers, because it would seem to be the most interesting effect, depending though on the interpretation of days (days after some event?). There are weak but clearly insignificant effects of groups and days. The estimated means show group 2 to have higher values, and the values to be increasing with days. The interaction plot shows a roughly linear relationship with days, with almost parallel curves. To study this further, we fit a model with a linear day effect MTB > GLM; SUBC> Response 'lnsals'; SUBC> Nodefault; SUBC> Continuous 'day'; SUBC> Categorical 'group' 'person'; SUBC> Unstandardized; SUBC> Nest person(group); SUBC> Random person; SUBC> Terms group day person day*group; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TFactor; SUBC> TEMS; SUBC> TVariance; SUBC> TMeans; SUBC> TDiagnostics 0; SUBC> Rtype 2. General Linear Model: lnsals versus day, group, person Factor Information Factor Type Levels Values group Fixed 2 1, 2 person(group) Random 14 1(1), 2(1), 3(1), 4(1), 5(1), 6(1), 1(2), 2(2), 3(2), 4(2), 5(2), 6(2), 7(2), 8(2) Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value group 1 1.5136 3.33% 0.0000 0.00001 0.00 0.998 x day 1 4.9850 10.97% 4.5404 4.54040 5.80 0.021 person(group) 12 7.3142 16.10% 7.3142 0.60951 0.78 0.668 day*group 1 0.3056 0.67% 0.3056 0.30565 0.39 0.536 Error 40 31.3075 68.92% 31.3075 0.78269 Total 55 45.4258 100.00% x Not an exact F-test. S R-sq R-sq(adj) PRESS R-sq(pred) 0.884697 31.08% 5.23% 63.4590 0.00% Fits and Diagnostics for Unusual Observations Std Del Obs lnsals Fit SE Fit 95% CI Resid Resid Resid HI Cook’s D DFITS 5 1.668 0.163 0.504 (-0.857, 1.182) 1.505 2.07 2.16 0.325000 0.13 1.50162 R 22 1.847 0.296 0.450 (-0.613, 1.205) 1.551 2.04 2.12 0.258333 0.09 1.25307 R 45 2.054 0.459 0.490 (-0.531, 1.448) 1.595 2.16 2.28 0.306250 0.13 1.51165 R R Large residual Expected Mean Squares, using Adjusted SS Expected Mean Square Source for Each Term 1 group (5) + 0.6667 (3) + Q[1] 2 day (5) + Q[2, 4] 3 person(group) (5) + 4.0000 (3) 4 day*group (5) + Q[4] 5 Error (5) Variance Components, using Adjusted SS Source Variance % of Total StDev % of Total person(group) -0.0432939* 0.00% 0.000000 0.00% Error 0.782688 100.00% 0.884697 100.00% Total 0.782688 0.884697 * Value is negative, and is estimated by zero. Comment: -------- In random effects models, two nested models cannot be compared as easily using an F-test as for usual linear models. If a statistical test is desired, one may use the more advanced theory for tests based on the likelihood function (mixed command in Stata or proc Mixed in SAS). In this case, however, it is quite clear that the linear day effect is very appropriate. It explains 4.985/5.097=98% of the variation between days, and most of the (very small) variation in day*group. There is no need of non-parallel slopes here, so we remove them from the model. MTB > GLM; SUBC> Response 'lnsals'; SUBC> Nodefault; SUBC> Continuous 'day'; SUBC> Categorical 'group' 'person'; SUBC> Unstandardized; SUBC> Nest person(group); SUBC> Random person; SUBC> Terms group day person; SUBC> Means group; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TFactor; SUBC> TEMS; SUBC> TVariance; SUBC> TMeans; SUBC> TDiagnostics 0; SUBC> Rtype 2. General Linear Model: lnsals versus day, group, person Factor Information Factor Type Levels Values group Fixed 2 1, 2 person(group) Random 14 1(1), 2(1), 3(1), 4(1), 5(1), 6(1), 1(2), 2(2), 3(2), 4(2), 5(2), 6(2), 7(2), 8(2) Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value group 1 1.514 3.33% 1.514 1.5136 2.48 0.141 day 1 4.985 10.97% 4.985 4.9850 6.47 0.015 person(group) 12 7.314 16.10% 7.314 0.6095 0.79 0.657 Error 41 31.613 69.59% 31.613 0.7711 Total 55 45.426 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.878096 30.41% 6.64% 59.8399 0.00% Fits and Diagnostics for Unusual Observations Std Del Obs lnsals Fit SE Fit 95% CI Resid Resid Resid HI Cook’s D DFITS 5 1.668 0.048 0.466 (-0.894, 0.990) 1.620 2.18 2.29 0.282143 0.12 1.43345 R 22 1.847 0.258 0.442 (-0.635, 1.151) 1.589 2.09 2.19 0.253571 0.10 1.27622 R 45 2.054 0.545 0.466 (-0.397, 1.487) 1.509 2.03 2.11 0.282143 0.11 1.32463 R 56 2.092 0.545 0.466 (-0.397, 1.487) 1.547 2.08 2.17 0.282143 0.11 1.36153 R R Large residual Expected Mean Squares, using Adjusted SS Expected Mean Square Source for Each Term 1 group (4) + 4.0000 (3) + Q[1] 2 day (4) + Q[2] 3 person(group) (4) + 4.0000 (3) 4 Error (4) Means Term Fitted Mean group 1 0.197503 2 0.529714 Data Covariate Mean StDev day 2.50 1.13 Variance Components, using Adjusted SS Source Variance % of Total StDev % of Total person(group) -0.0403851* 0.00% 0.000000 0.00% Error 0.771053 100.00% 0.878096 100.00% Total 0.771053 0.878096 * Value is negative, and is estimated by zero. Comments: --------- The resulting model has a significant regression coefficient for days: the ln-salsolinol levels increase by 0.27 units per day, or equivalently it can be said that the salsolinol levels increase by a factor of exp(0.27)=1.3 per day. There is perhaps an indication of higher levels in group 2 than in group 1, but it is non-significant (P=0.14). One assumption of the split-plot analysis that cannot easily be checked in Minitab is that the correlation within persons is independent of time. As the person-variability is estimated as very low, it means that observations on the same person are essentially independent. A repeated measures ANOVA or a full mixed model analysis is required to assess the assumption, see the SAS-program hs10_7.sas or the Stata do-file hs10_7.do.