BIOS601 AGENDA: Week of September 10 to September September 17, 2021

[updated September 08, 2022]

Agenda for Week of September 10 to September September 16, 2022

Discussion of issues in JH's Notes and assignment on C&H Ch01 [prob. Models] and Ch02 [conditional Prob. Model]

Answers to be uploaded to MyCourses individually by end of the 'business' week for:
Supplementary exercises 1.1 (p.4), 1.2 (p.5), 2.1 , 2.2[PhD], 2.3, 2.4[PhD], 2.5 (p. 9), 2.15, 2.18

Answers to Supplementary exercise 2.11 to be uploaded by 'corresponding authors' of 2 teams of non-PhD students (teams and authors to be decided)

Answers to Supplementary exercise 2.13 to be uploaded by corresponding author of 1 team of PhD students

Remarks:

Chapter 1 of C&H introduces some ways of looking at statistical entities and concepts that you may not have met, as well as some terminology that is used in a more specific way in epidemiology. You might want to look at section 1 of JH's notes, from earlier years, on Concepts involved in Occurrence Measures in Epidemiology. JH has also included the first page of this section (mostly definitions) in the notes that annotate the C and H chapters: he has placed it under the heading 'Important: Concepts and Terms in Epidemiology' after his notes on 1.2 Binary data, and before 1.3 The binary probability model.

Supplementary Exercise 1.1 is designed to get you familiar with the 'other' scales for measuring probabilities, and when the odds and probability measures are close, and when they diverge. Other scales you will need to become very familiar with are the logit and the probit scales We show all of these in one graph in our 'under construction' online textbook for epidemiology students. . The online book has newer versions of some of the graphs JH in these notes, as well as additional commentaries.

JH's notes on Section 1.4 of C&H (and Supp. exercise 1.2) are intended to 'shake you up a bit' and force you to think outside the box as for how you used to estimate the parameters of a simple linear regression. This model is usually shown as a 2-parameter (slope, intercept) model, but JH has deliberately reduced the model to a 1-parameter version, with the "line" going through the origin [other examples might be trying to estimate (from error-containing measurements of the volumes of 2 spheres of different radii: radii measured withut error!) the constant in the relation: Volume of a sphere = "some constant" times the cube of its diameter.] The fewer the elements involved, the more chance there is to really master the fundamentals and 'join the dots.'
He has recently added a shiny app that allows additional criteria for the 'fit'.

You can also try the another 1-parameter (elevator) example at the bottom of that webpage.

Chapter 2 of C&H is -- to JH at least -- a very elegant and simple and graphic way to introduce probabilities, and particularly those that are linked to each other in time, or by additional pieces of knowledge. And notice how many probabilities of interest go from right to left, i.e., from after to before. It is worthwhile to work through C&H's own exercises and then check your answers agains the solutions they provide at the end of Ch 2.

Fig 2 in JH's Notes on Ch 2 has several simple but educational examples showing the different 'directionalities'. It also emphasizes that products of probabilities are like 'fractions of fractions' but that sometimes, the probabilities depend on what has gone before, and sometimes do not. (the online book has newer diagrams)

The 2 stories accompanying the Notes on section 2.2 should serve as a stark and frightening reminder that P(theta|data) is a very different 'animal' than P(data|theta) and that the consequences of mixing them can be enormous.

If you want a topical example, think of the difference between P(A|B) and P(B|A), where A = the hypothesis that Higgs Boson particles exist, and B = the bump in the curve. Btw, JH likes to label the elements in what appears to be the best 'logical' or 'chronological' or 'causal' order, i.e., A -> B, but notices that many textbooks teach the concepts using arbitrary letters.

JH's notes on Section 2.3 have a genetics (haemophilia) example that is still very relevant. But, since he first encountered it 40 years ago, medical science has advanced , and so one doesn't not now need to wait until the woman has one or more offspring before learning about her carrier status. JH would be grateful for a different example where one would still need to wait.

At a debate a few years ago, JH came up with the challenge of estimating/judging a person's age from various pieces of information. You might like to take a quick look at the example & pieces of information provided

Supplementary Exercise 2.1 ('Efron's twins story') can be tackled in many ways. Efron uses the odds scale to go from 'pre-' to 'post'-test odds, and then switches back to the probability scale. We do the same when teaching medical students about diagnostic tests. Fortunately, today, with readily accessible apps, there is less emphasis on the calculation, and more on the probabilities themselves. A few pages further in the notes, you can will see what (paper) 'apps' were like in 1975! Fagan's nomogram is still a clever tool, and JH has used it as a starting point for a shiny app cited on the coloured box on the right hand side of page 8 of his Notes. This box gives you links to the 'terminology' for the errors/performance of medical diagnostic tests (If JH had his way, we would never have invented the terms sensitivity and specificity) and the correspondences with statistical tests.

Supplementary Exercise 2.2 ('The Monty Hall Problem') can be very frustrating and is easily misunderstood. JH has had to break up fights between people who are over-confident but under-listening. Key is the fact that Monty Hall KNOWS which door contains which: sometimes (how often?) he has a choice of 2 doors that he could open to reveal nothing, and sometimes (how often?) he only has 1 choice.

In Exercise 2.3, it is equally important to be very precise as to the information provided.

In Exercise 2.4, we have another good example of the difference between P(H|data) and P(data|H). Notice here that we are not examining a range of possible H's, just 2 specific H's. Notice further that in the Bayesian approach we do not consider data values that have not been observed; in contrast, the p-value does consider data values that have not been observed (we should not call such unobserved values 'data', but rather, potential data values.

JH finds that diagrams, especially 'tree' diagrams, can be very helpful in these types of problems, and again when we revisit the Binomial.

Q2.5 was new in 2015, so the wording hasn't had the same beta-testing as 2.1-2.4.

Its a pity that in the otherwise clever 'left brain' article, the BMJ messed up on the 'teaser' introduction. JH finds The Economist graphics clearer and simpler. What about you?

Q2.11 was new in 2019, having been prompted by a (since withdrawn) tutorial article 'How to investigate an accused serial sexual harasser' in Statistics in Medicine. If you Google it, you will see that it generated considerable 'heat'. The tutorial referred indirectly to the data given in the exercise. JH took a special interest in the topic because of his involvement in reviewing the 2003 report.

Q2.12 is new in 2020, and was prompted by the coverage of the Santa Clara study in Andrew Gelman's blog. The Santa Clare study was also the basis for exercise 22 in the measurement material, and a question in the Part A (bios700) PhD exam of August 4, 2020. We will come back to it again when we adress Likelihood-only methods in C&H chapter 3.

Q2.13 is new in 2021, and was prompted by the increasing numbers of statements about who the patients are that are being hospitalized for COVID-19.

It is also a great chance to learn a very common epidemiologic design, one that goes by a very pooly chosen name -- the so-called (and very badly called) 'CASE-CONTROL' study. We explain what it involves, and why it is a simple and otherwise standard comparison of two rates, but where the (relative) (or maybe even the absolute) sizes of the person-time denominators are ESTIMATED (using a denominator series) rather than KNOWN.