Assignment III for Biostats Course VHM 801 at AVC - Fall semester 2021

The assignment is worth 15% of the final course mark. Please be aware that by handing in the home assignment you implicitly acknowledge to have read and accepted the instructions for home assignments as described on the VHM 801 homepage. Recall that all solutions should be accompanied by text explaining the procedures used; in particular, all statistical models and assumptions should be specified. The level of detail provided should at least conform to the main principles of the recommendations for statistical reporting, as discussed in the course. Because the text of the assignment is rather long, it is suggested that you do not include the text itself among your answers.

This home assignment involves analysis of two unrelated datasets. The questions pertaining to the first dataset comprise roughly 10 of the 15 points for the mark of the assignment.

Study A
The data for this study were collected by Charles Caraguel as part of his PhD project in the Department of Health Management, AVC. The interest is in assessing the test characteristics of Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR) tests for Infectious Salmon Anemia Virus (ISAV) run at different laboratories and involving different preparations of the samples. Here we consider kidney samples of 95 fish from salmon aquaculture sites in New Brunswick. The tissue samples from each fish were divided into several subsamples. One subsample was directly submitted to a laboratory for testing. Another subsample was preprocessed at AVC before it was submitted to the same lab for testing. The dataset described below therefore contains two subsamples derived from each fish. In addition to the results of the testing of the different subsamples, all the information obtained from each fish (including a lot of information not shown here) was combined into a single indicator of whether the fish was indeed infected with ISAV or not. The meaning of the variables in the files is explained below. Note that for all variables a disease-positive status/result is indicated by a "1", and a disease-negative status/result by a "0". The data are available as datafiles in both a Minitab format and a comma-separated format, for import into Stata and other statistical software.

The RT-PCR test run on samples with different preparations (preprocessed or not) may be considered as two distinct versions of the same diagnostic test. For simplicity of wording we will here refer to them as two distinct diagnostic tests. Recall that the test sensitivity is the chance (probability) of detecting a truly diseased subject by the diagnostic test, whereas the test specificity is the chance (probability) of getting a negative diagnostic test result for a truly non-diseased subject. Ideally a diagnostic test should have high values of both the sensitivity and specificity but this may be difficult to achieve in practice. Thus it is important to know these characteristics; for example, if a test has a low specificity (say 0.80) it means that the test will give a quite high (in the example, 20%) rate of false positives.

  1. For disease status defined by the variable infected, estimate for each of the two diagnostic tests the sensitivity and specificity, and give also corresponding interval estimates (make sure to describe your method of calculation). Do these calculations indicate one of the two tests to be superior to the other one? In order to quantify a comparison between the two tests, one would like to carry out a significance test. Do you think it is valid to do a two-sample z-test to compare the sensitivity of the two tests? If so, carry out the test and draw conclusions; if not, explain why and indicate what type of model/approach should be used instead (without carrying out any calculations).

  2. The outcomes of two diagnostic tests may be independent or dependent (correlated). The immediate form of assessing a dependence is in the table of overall counts of positive and negative outcomes by the two tests. Construct this table for the ISAV data, and use a significance test to assess whether there is a such a dependence between the tests. Draw conclusions, and discuss what your finding actually tells us, in practical terms, about the two tests. For this discussion, it may be useful to think about what we could (should) conclude if there was no such dependence between two tests in a population containing both truly positive and negative subjects.

  3. A potentially more useful way of defining independence or dependence considers the performance of the two tests separately in the subpopulations of truly positive and negative subjects (this form of dependence is often called conditional (in)dependence). For the ISAV data, determine again the two subpopulations by the variable infected, and use suitable significance tests to assess whether the tests are conditionally independent in each of these two subpopulations. Draw conclusions, and discuss what your finding actually tells us, in practical terms, about the two tests. For this discussion, it may be useful to think about what we could (should) conclude if there was a strong conditional dependence between two tests.

Study B
In a study of morbidity and nutritional status in 1165 preschool children living in poor conditions in Delhi, India, data were obtained on nutrition and illness. Nutrition was described by a standard method as normal or as one of four levels of inadequate: I, II, III, and IV. For the purpose of analysis, the two most severely undernourished groups, III and IV, were combined. One part of the study examined four categories of illness during the past year: upper respiratory infection (URI), diarrhea, URI and diarrhea, and none. The data obtained, more information about the study context, and a reference to a publication describing the study are given in IPS7e Supplementary Exercise 9.4. The data are available in Minitab format and as a comma-separated file, for import into Stata and other statistical software.

  1. Analyze the data to assess the association between nutritional status and type of illness. Include in your analysis a carefully motivated statistical model and hypothesis, as well as a corresponding statistical test and its conclusion, both in statistical and practical terms. Make sure to address as part of your analysis any conditions/assumptions your model and statistical procedures rely on. Postpone to the next question a detailed description of the relationship (if any) shown in the data between illness and nutrition status.

  2. The task in this question is to further study and describe any association in the data between type of illness and nutritional status. A useful tool for such a description in a two-way table is a method called successive splitting, described in detail below. It is recommended, but not mandatory to use this method. If you decide to not use it, you will need to quantify, in another way, the association studied by relevant estimates and detailed assessments of statistical significance for the patterns you describe. Summarize all your analyses into a single conclusion for the data.

    One approach to further explore a two-way table with a (significant) association, is by successive splitting; the textbooks Christensen (1996/2013): (Unbalanced) Analysis of Variance, Design and Regression, denote the method as "Lancaster-Irwin partitioning", and give a worked example (which is more advanced than needed for the present data). The method involves several steps, and in each step a two-way table is split into a reduced and a collapsed table, as detailed below. The splitting procedure stops when there is no association left in a table (in which case further splitting would give no new information) or a table is a 2*2-table (in which case no further splitting is possible). Tables may be split in different ways, and the choice of a split should be guided both by the interpretation of the groups defining the table and the actual patterns of the table. Ideally the split tables should have either a strongly significant or a completely non-significant association, but this is not always achievable.

    A single split is carried out as follows. Suppose that we want to focus on a subtable defined by omitting one or more columns of an initial 3*4 table. The resulting reduced table (without the omitted column(s)) is the first part of the split. The second part of the split is a table containing the column(s) omitted and one extra column totalling the columns contained in the reduced table. For example, the 3*4 table may be split into a reduced 3*3 table and a collapsed 3*2 table by omitting one column. Alternatively, when omitting one of the rows, a reduced 2*4 table and a collapsed 2*4 table result. In any case, both of the newly constructed tables are then analysed by the usual two-way table procedures, and may be further split in the next steps. The key idea behind the splitting is that the Pearson chi-square statistic of the full table is approximately the sum of the Pearson chi-square statistics of the two split tables, and the degrees of freedom add up exactly (in the above examples, 6=4+2 or 6=3+3). Therefore, the procedure amounts to breaking the Pearson chi-square statistic up into components.

    Technical note: As the splitting is guided by the data and involves multiple tests, the P-values obtained from the usual chi-square distributions for the split tables are not valid in a strict sense. One may either consider the analysis as exploratory and give less weight to formal statistical significance, or use the conservative rule to assess all test statistics in the chi-square distribution with the degrees of freedom of the original (full) table.

    Practical hint: For analysis in Minitab, it is probably easier to type the numbers of the split tables into the worksheet and use the "Summarized data in a two-way table" format for the data, than to manipulate the datafile provided.


Henrik Stryhn (hstryhn@upei.ca) 2021-11-04