The assignment is worth 15% of the final course mark. Please be aware that by handing in the home assignment you implicitly acknowledge to have read and accepted the instructions for home assignments as described on the VHM 801 homepage. Recall that all solutions should be accompanied by text explaining the procedures used; in particular, all statistical models and assumptions should be specified. The level of detail provided should at least conform to the main principles of the recommendations for statistical reporting, as discussed in the course. Because the text of the assignment is rather long, it is suggested that you do not include the text itself among your answers.
This home assignment involves analysis of two unrelated datasets. The questions pertaining to the first dataset comprise roughly 10 of the 15 points for the mark of the assignment.
Study A
The data for this study were collected by Charles Caraguel
as part of his PhD project in the Department of Health Management, AVC.
The interest is in assessing the test characteristics of Reverse Transcriptase-Polymerase
Chain Reaction (RT-PCR) tests
for Infectious Salmon Anemia Virus (ISAV) run at different
laboratories and involving different preparations of the samples. Here we consider
kidney samples of 95 fish from salmon aquaculture sites in New Brunswick.
The tissue samples from each fish were divided into several subsamples. One subsample was
directly submitted to a laboratory for testing. Another subsample was preprocessed at AVC before
it was submitted to the same lab for testing. The dataset described below therefore contains
two subsamples derived from each fish. In addition to the results of the testing
of the different subsamples, all the information obtained from each fish (including
a lot of information not shown here) was combined
into a single indicator of whether the fish was indeed infected with ISAV or not.
The meaning of the variables in the files is explained below. Note that
for all variables a disease-positive status/result is
indicated by a "1", and a disease-negative status/result by a "0". The data are available as datafiles in both a Minitab format and a comma-separated format, for
import into Stata and other statistical software.
The RT-PCR test run on samples with different preparations (preprocessed or not) may be considered as two distinct versions of the same diagnostic test. For simplicity of wording we will here refer to them as two distinct diagnostic tests. Recall that the test sensitivity is the chance (probability) of detecting a truly diseased subject by the diagnostic test, whereas the test specificity is the chance (probability) of getting a negative diagnostic test result for a truly non-diseased subject. Ideally a diagnostic test should have high values of both the sensitivity and specificity but this may be difficult to achieve in practice. Thus it is important to know these characteristics; for example, if a test has a low specificity (say 0.80) it means that the test will give a quite high (in the example, 20%) rate of false positives.
Study B
In a study of morbidity and nutritional status in 1165 preschool children
living in poor conditions in Delhi, India, data were obtained on nutrition and illness.
Nutrition was described by a standard method as normal or as one of four levels
of inadequate: I, II, III, and IV. For the purpose of analysis, the two most severely
undernourished groups, III and IV, were combined. One part of the study examined
four categories of illness during the past year: upper respiratory infection (URI),
diarrhea, URI and diarrhea, and none. The data obtained, more information about the study context, and a
reference to a publication describing the study are given in IPS7e
Supplementary Exercise 9.4. The data are available in Minitab format and as a comma-separated file, for
import into Stata and other statistical software.
One approach to further explore a two-way table with a (significant) association, is by successive splitting; the textbooks Christensen (1996/2013): (Unbalanced) Analysis of Variance, Design and Regression, denote the method as "Lancaster-Irwin partitioning", and give a worked example (which is more advanced than needed for the present data). The method involves several steps, and in each step a two-way table is split into a reduced and a collapsed table, as detailed below. The splitting procedure stops when there is no association left in a table (in which case further splitting would give no new information) or a table is a 2*2-table (in which case no further splitting is possible). Tables may be split in different ways, and the choice of a split should be guided both by the interpretation of the groups defining the table and the actual patterns of the table. Ideally the split tables should have either a strongly significant or a completely non-significant association, but this is not always achievable.
A single split is carried out as follows. Suppose that we want to focus on a subtable defined by omitting one or more columns of an initial 3*4 table. The resulting reduced table (without the omitted column(s)) is the first part of the split. The second part of the split is a table containing the column(s) omitted and one extra column totalling the columns contained in the reduced table. For example, the 3*4 table may be split into a reduced 3*3 table and a collapsed 3*2 table by omitting one column. Alternatively, when omitting one of the rows, a reduced 2*4 table and a collapsed 2*4 table result. In any case, both of the newly constructed tables are then analysed by the usual two-way table procedures, and may be further split in the next steps. The key idea behind the splitting is that the Pearson chi-square statistic of the full table is approximately the sum of the Pearson chi-square statistics of the two split tables, and the degrees of freedom add up exactly (in the above examples, 6=4+2 or 6=3+3). Therefore, the procedure amounts to breaking the Pearson chi-square statistic up into components.
Technical note: As the splitting is guided by the data and involves multiple tests, the P-values obtained from the usual chi-square distributions for the split tables are not valid in a strict sense. One may either consider the analysis as exploratory and give less weight to formal statistical significance, or use the conservative rule to assess all test statistics in the chi-square distribution with the degrees of freedom of the original (full) table.
Practical hint: For analysis in Minitab, it is probably easier to type the numbers of the split tables into the worksheet and use the "Summarized data in a two-way table" format for the data, than to manipulate the datafile provided.