BIOS601: warm-up for bios601, 2018
[updated August 21, 2018]
- The objective is to use
Q1, Q2, Q3, Q5, Q7, Q8, Q9, and Q10
to
recall and apply statistical concepts and techniques you
have already encountered in your training up to now, and to get up to speed in R.
Some of these concepts/techniques are 'hidden' in (deliberately)
practical exercises, since JH is a firm believer that in graduate school,
the rhythm should not be 'technique -> example' but rather
'problem -> ? technique'.
Whatever answers to Qs 1, 2, 3, 5, 7, 8, 9, and 10 you are able to put together by
5 pm on Friday August 31 should be emailed to JH by then.
Any 'JH-readable' format (email text, pdf, word, scans or photos of handritten material, etc) is acceptable,
as long as the files are not prohibitively large.
[ JH does not want you to work on them after that date.
Instead, he will be asking you to begin working on the
Qs on measurement -- a topic that is less familiar to you and
that he will address in the 1st class
(on Tuesday September 04). He expects to have the 2018 version of
the measurement exercises finalized soon,
and to set a deadline of Friday September 7 for the
Qs he will assign on the measurement topic. (If you are curious you can look at the
2017 version)
For the first several weeks (when the material is
mostly déjà vu), the rhythm will be: you begin working on
the assigned Qs the weekend BEFORE, we discuss the Qs
in class on Tuesday and Thursday,
and you email me your final answers by Friday.]
The first (general) computing issue is (if need be) to get up to speed
in the use of R. See the R links on the main course page.
If you run into problems, let JH know ASAP.
Remarks on specific questions :
Q1
Use the (.csv) dataset of 200 that JH has assembled for you.
Q1 (ii), Q2 (ii)
Some of the conceptual and practical statistical issues raised by this assignment include the
distinction between standard deviation and
standard error; the concept of a margin of error;
when it is appropriate to use the Normal (Gaussian) approximation to the binomial distribution;
the (often under-appreciated) centrality of the
Central Limit Theorem (CLT) in
applied statistical work, not just for the sampling distribution of a
sample proportion, but also for that of a
sample mean.
Other points the exercise tries to make are that most of sampling theory involves
calculation of variances. These derivations bring you back to fundamentals, and
its good to be able to work them out from scratch rather than consult a textbook or the internet
and copy the formulae blindly without having an understanding of what the
formula should look like.
You will notice that we have already starting calling the sqrt of the variance of a STATISTIC
the STANDARD ERROR of that statistic. One key point about a standard error is that is refers
to the variability of a STATISTIC, not of individual observations. Some writers reserve the
term for cases where one uses a plug-in estimate into a variance formula. For example, they would call
sigma/sqrt(n) the standard deviation of the sample mean, but they would call
s/sqrt(n) the estimated standard deviation, or standard error for short. Bear in mind that this
terminology is not standardized across the profession.
Q2 Again, use the already-assembled dataset.
Q3 part ii. It would be good to use the d- (density/mass) function for this r.v.
to
make a graph of the probability mass function for the r.v. n'.
Doing so will make it easier to see where the 2 tails begin (i.e where the 10th and 90-th percentiles are).
The graph will also justify the remark about the 'near-Gaussian-ness' of this
distribution.
Q4, 2016
One of the points of asking you to think about various sampling options to get at the
'census' or 100% answer is to get you to think of sampling as a measurement tool.
Usually, the fancier and costlier the measuring instrument, the better the measurement.
But we can't always afford the million-dollar answer and have to live with the hundred dollar answer.
I once encountered a high-up Air Canada executive who didn't like the fact that the Canadian long-form
portion of the census involved only a 10% sample, and he wasn's sampled. So he did not trust the
results, since not everyone was surveyed. I asked him whether when a doctor drew a blood sample from
him, he should give 100% of his blood so as to get an accurate concentration value.
The Harper government did away with a the 1 in 10 random sample that was a compulsory
component of the Canadian census,
and replaced t with a 1 in 3 sample, so that the large refusal rate bigger sample size would
still leave about the same n. What you you think of estimates based on this scheme, which has had only about a 65%
participation rate?
The participation rate in the old 1-in-10 sample was 99.999% and (despite Harper's
claims to the contrary) only a handful complained or refused.
One of the first acts of the Trudeau government was to re-instate the mandatory long-form census.
A lot of psychological or psychometric measurement also involves sampling -- of items
to use in a time-limited questionnaire or exam that
can only contain a sample of the items that might be asked about
(think of the format of the old paper-based GRE exams described in the Measurement Statistics Notes). The measurement model
observed estimate = TRUTH + error
is the same whether the error comes for sampling of items or of persons
or of time. In one case we might call it measurement error, and in another sampling error.
But it could include both!
Is it worth extracting and
entering all of the step counts in all of their digits to have an answer that has too many decimal places for what one wants?
And in so doing, we overlook other SEs -- ie statistical errors than do not decrease in magnitude
as we increase the sqrt(n)! You can probably think of some in the case of the StepCounter!
JH likes to say that besides standard error, the abbreviation SE could stand for many other types of error.
It could be SAMPLING error, or STATISTICAL error, or STUPID error. Sadly,
statistical theory is only good at quantifying sampling error, where the sqrt(n) is always
in the underside of the formula. BUT IT DOES NOT KNOW HOW TO JUDGE OTHER TYPES OF ERROR,
and SOMETIMES THESE CAN BE A LOT LARGER THAN THE SAMPLING ERROR, AND THESE NON-SAMPLING ERRORS
CANNOT BE MADE SMALLER BY INCREASING n.
(Note added 2018):
Do you notice anything 'special' or
'different' about the 2016 data? It should not take any fancy
statistical tests .. just the 'intra-ocular traumatic test'.
INCREASING n will just make the answer MORE PECISELY WRONG
Q5
This returns to the question posed about Q1 at the very beginning,
namely how to draw a sample from a non-uniform distribution.
Q6, 2016
Like the others, Q6 is blend of the theoretical and the practical. Here you are asked to use
R to read in the .csv file, and to produce some summary statistics, and calculate some standard errors.
You probably have not worked out the SE [sqrt(Variance)] for a ratio, but this is a good example
of something often used in applied work. Hint: the log of a ratio is the difference of its components; the
approx. variance of a log of a positive rv is (by the Delta method) the original variance times the
square of the Jacobian or scaling factor, evaluated at/near the centre of the old scale.
Think of the variance of September temperatures in F as the variance of September temperatures in C,
multiplied by the square of the scaling factor... the scale of F is 9/5 ths larger than the scale of C.
If the scaling is not linear (eg an elastic band) use the scale factor at the centre.
Reference is made to the Finite Population Correction (FPC) for the sampling variance. This would apply in
cases where you sample n (< N) of the N members of Canadian or U.S. Senate, or some of
the 40 pages of gasoline purchases.
It is written in slightly different version by different people, but JH tends to think of
it's approx. value as (1 - n/N). To get the proper variance (assuming in our case that the target is
just these 2 years, nothing else), the variance computed under an infinite population or
sampling with replacement assumption needs to be multiplied by this less-than-unity factor,
so that in the limit, if we sampled all n of the N, so that the FPC = 1 - n/N =0, the variance
for our (census) estimate is 0.
The form of the FPC can be derived for the binary response case using
the ratio of the hypergeometric to Binomial variance: the binomial is for samples from an infinite
population -- or a finite one but sampling with replacement-- whereas the hypergeometric is for samples
from a finite one, but without replacement.
Your other choice of method to sample the days from the scanned logs can be based on what
factors you think most affect activity, and can be used in the sampling design. There is no one
best one a priori (indeed the computerized data from 2010-2011 could be used to test out various
designs/estimators but this is not required for the exercise, which is designed just to get
you thinking about the issues.
When using R (or another random number generator or random number table) keep track
of how exactly you started the sequence (see set.seed in R).
One of the savings to think about when entering data is discarding digits... what would be the effect
if you only entered thousands of steps (rounded or truncated to an integer number of thousands) or
hundreds of steps. Can you anticipate how much 'damage' is done by such approximations?
Q7
This Q was new last year, so the wording might still need some
polishing -- email JH if anything is unclear, or if you find yourself
spending a long time trying to figure out what is being asked -- it may
well be that the issue is with the wording rather than you!
JH will amalgamate the various samples so as to make a 100% sample,
or 'census' where the
only 'SE' is a statistical error (or recording or data-entry
or data-processing error).
In a census, the traditional SE (STANDARD Error) should be 0, but as we saw last
week with the Statistics Canada report on languages spoken in Quebec,
there can sometimes be some other and very embarrassing types of SE's.
link
link
link
Q8
This Q is new this year, and was prompted by a talk by
Nicholas Horton at SSC this summer. Nick has been complaining
about the not-always-very-imaginative way math-stat continues
to be taught, sometimes just as in the 1950s ! Nick
has a very inspiring article on how the teaching of math-stat
can be modernized, without loosing the mathematical rigour.
And, instead of just complaining, he is doing something about it.
JH has often complained about the many exercises in (the otherwise
very good) textbook by Cassela
and Berger, where the sole task seems to be to integrate
a pdf over some 2-d region, without any motivation, or care for
how such an issue could arise in practice.
We will encounter a practical example of integrating over such
a region when we come to the 'getting from the Peel to the Vendome
Metro stations' in a few weeks.
Some additional links on Nick's work are
http://magazine.amstat.org/blog/2017/09/01/nick_horton/
and
Teaching the Next Generation of Statistics Students to 'Think With Data': Special Issue on Statistics and the Undergraduate Curriculum
https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2015.1094283#.W3wS5C0ZPqU
Q9
This Q is new this year, but it is an old problem.
JH heard of it from McGill Math-Stat prof David Wolfson.
He would give the 'for real' assignment of actually
tossing the coin 100 times, and then amaze the students
by identifying which of them really did, and which ones
merely 'made up' the sequence.
Q10
This is probably the most important of all of the Qs !!