STAT5602 ___________________________________________________________________________ Two-Way Contingency Tables Joint, Marginal and Conditional Distributions
Consider X and Y two categorical response variables, with X having I levels and Y having Jlevels and suppose we classify each item in a population using both of these variables.
Responses (X, Y) corresponding to a randomly chosen item from this population have a jointprobability distribution. Let
ij denote the probability that X assumes its ith level and Y
Consider the following I x J contingency table:
is the joint distribution of X and Y ; it defines the
(bivariate) relationship between the two variables.
The marginal distributions of X and Y are the row and column totals (respectively), obtained by summing the joint probabilities. These are denoted by
The marginal distributions represent single-variable information; they do not refer to association links between the two variables.
are unknown but they can be estimated by sampling. STAT5602 ___________________________________________________________________________ Example: Consider a sample of 3566 individuals cross-classified by smoking status and sleep problems. This yields a 2 x 2 contingency table.
Note: The overall sample size is fixed but row and column totals are not fixed. Thus this study corresponds to a multinomial sample with 4 outcomes. The maximum likelihood estimates (M.L.E.s) of
Sometimes one variable can be thought of as a response variable and the other as anexplanatory variable. (In this study, we might treat sleep problems as a response variable andsmoking status as an explanatory variable.) For such cases, it may be useful to construct aseparate probability distribution for Y at each level of X. Given that an item is classified inrow i of X, let
denote the probability of classification in column j of Y. This yields the
represent the conditional distribution of Y at the ith level of X. The conditional distribution of Y given X is related to the joint distribution of
Usually,these conditional probability distributions are also unknown but they can beestimated by sampling.
For our example, we estimate the conditional probability distribution for sleep problems
at the ith level of service in Vietnam using
STAT5602 ___________________________________________________________________________ Independence When both variables are response variables, we can describe their association using: - their joint distribution, - the conditional distribution of Y given X or - the conditional distribution of X given Y.
The variables X and Y are statistically independent if
Thus, for X and Y independent,
When Y is a response and X is an explanatory variable, the condition
for all j is a more natural definition of independence. Note: In some tables where Y is a response variable and X is an explanatory variable, X is fixed rather than random. Then the notion of a joint distribution for X and Y is no longer meaningful. However, for a fixed level of X, Y has a probability distribution. Thus we can consider the conditional distribution of Y for different fixed levels of X. Test for Homogeneity: Prospective Study Example: The Physicians’ Health Study was a 5 year study testing whether regular intake of aspirin reduces mortality from cardiovascular disease. In this study, 22,071 physicians were randomly assigned either to a group that was to take one aspirin tablet every other day or to a group that was to take a placebo every other day. Of the 22,071 physicians, 11,034 were assigned to receive the placebo and 11,037 were assigned to receive aspirin. The study was blind - i.e. the physicians did not know which type of pill they were assigned to take. Of the 11,034 physicians taking the placebo, 189 suffered myocardial infarcation (MI) over the course of the study (18 of i were fatal) while of the 11,037 taking aspirin, 104 suffered MI (5 of which were fatal). The results are summarized in the following 2 x 2 contingency table: STAT5602 ___________________________________________________________________________
Source: Preliminary report: Findings from the aspirin component
of the ongoing hysicians’ Health Study, N. Engl. J. Med. 318:
Is the proporton of physicians taking a placebo who suffer MI the same as the proportion ofphysicians taking aspirin who suffer MI?
This is an example of a prospective study. (Note: In a prospective study, the row totals are fixed.) (NOTE: This study is a clinical trial, since physicians are assigned to the placebo and aspirin groups by the investigators. Another type of prospective study is a cohort study, where the researchers do not assign individuals to groups. e.g. to study the effect of smoking on MI, a researcher might select a sample of smokers independently of a sample of nonsmokers, but the researcher does not assign individuals to the smoking and nonsmoking groups.)
probability of suffering MI, given that the physician takes the placebo
probability of not suffering MI given that the physician takes the placebo
probabiility of suffering MI given that the physician takes aspirin
probability of not suffering MI given that the pysician takes aspirin
j i . This allows us to determine estimated expected
frequencies mij. Pearson’s Chi-square test statistic can then be used here. STAT5602 ___________________________________________________________________________
so the kernel of the likihood is .
The log likelihood of the kernel is
Thus, under H0,
Using this, the estimated frequencies are
We obtain Pearson’s X2 . Recall that for large samples, X2
1. The p-value is approximately 0, so there is strong evidence against H0.
A likelihood ratio Chi-square test can also be used here. First we maximize the likelihood under H0; then we maximize the likelihood under H0
as the ratio of these two maximized likelihoods.
For the test for homogeneity above, the kernel is
Recall that, when H0 is assumed to be true, the kernel simply becomes
and the log likelihood of this kernel is maximized at
Consider now the kernel in the general context.H0
For our example above, G2 (called Wilks‘ statistic) is :
nij log nij/mij and mij
1 (same as for Pearson’s Chi-square test).
with p value of approximately 0, again concluding that there is strong evidence against H0
Now we try to understand the nature of this difference in proportions of physicians taking aspirin who suffer MI and those physicians taking a placebo who suffer MI. We do this by examining confidence intervals, relative risk, and odds ratios. STAT5602 ___________________________________________________________________________ Large Sample Confidence Interval for 1 1
We showed that the MLEs of 1 1 and 1 2 were
where n1 and n2 are fixed. Also, n11 and n21 are independent binomial random variables with means and variances
Consequently p1 1 and p1 2 are independent with means and variances
For large samples, we can use the fact that p1 1 and p1 2 will be approximately normallydistributed. Thus a 100 1
For our example, to obtain a a 95% confidence interval for 1 1
and thus a 95% confidence interval for 1 1
Noting that the interval does not contain 0 , this indicates that aspirin appears to diminish therisk of MI. STAT5602 ___________________________________________________________________________ Relative Risk
A difference between two proportions may have greater importance when both proportionsare near 0 or 1 than when they are near the middle. So, instead of studying the effect ofaspirin on MI by considering the difference 1 1
we could look at the relative risk,
which is the ratio of the ”success” probabilities for the 2 groups.In this case, ”success”represents having MI.
1 2 (i.e. the response is not affected by the group) so
to estimate the population relative risk. For our data, we
1.82. This implies that the sample proportion
of MI cases was 82% higher for the group taking the placebo. Note that a relative risof 1.0 corresponds to independence. Obtaining a 100 1
% confidence interval for the (population)
p based on 1 1 : p1 2
The problem here is that the distribution of 1 1
is highly skewed unless our sample sizes are
extremely large. So instead, we obtain a confidence interval for log
To derive the confidence interval, we use the delta method. STAT5602 ___________________________________________________________________________ The delta method for a function of a random variable:
Let Tn be a statistic, depending on a sample of size n. For large samples, suppose Tn isapproximately normally distributed with mean 0 and variance
Using a Taylor series expansion of g Tn around , we can write
converges in probability to 0 as nNow we want a confidence interval for log
We start with the point estimator of logSTAT5602 ___________________________________________________________________________
For our example above, the 95%C.I. for log
Now taking antilogs,a 95%C.I. for the relative risk
This means that we are 95% confident in stating that, after 5 years, the proportion of MIcases for physicians taking a placebo every second day is between 1.43 and 2.31 times theproportion of MI cases for physicians taking a single aspirin every second day.
Note: Sometimes we might want to estimate the ratio of the ”failure” probabilities
than the ratio of ”success” probabilities
STAT5602 ___________________________________________________________________________ Odds Ratio Another measure of association in contingency tables is the odds ratio
Consider again the physician example above. Within row 1, the odds that the response is incolumn 1 instead of column 2 is
Similarly within row 2, the corresponding odds ratio is
1 then response 1 is more likely than response 2 in row i.
Within-row conditional distributions are identical iff
2 is called the odds ( or cross product) ratio
1 response is not affected by group.
We estimate the population odds ratio
meaning the odds of MI are 83% higher for physicians in the placebo group. STAT5602 ___________________________________________________________________________
% C.I. for the population odds ratio
again, since the sampling distribution of
is highly skewed except for extremely large
sample sizes, we first obtain a confidence interval for log
This means we are 95% confident that, after 5 years, the odds of MI for physicians taking aplacebo every second day is between 1.44 and 2.33 times the odds of MI for physicianstaking aspirin. Relationship between Odds Ratio and Relative Risk Since
So when the probabilities of ”success” for both groups ( i.e. 1 1 and 1 2 ) are close to zero,the odds ratio and the relative risk are similar. (This happens for our physician example and,in general, for a rare condition.)
STAT5602 ___________________________________________________________________________ SAS program for the physician example. If the data is internal to the program:data aspirin;input Group $ MI $ count;cards;Placebo YES 189Placebo No 10845Aspirin Yes 104Aspirin No 10933;proc freq order
tables GROUP*MI/ chisq expected cellchi2 nocol nopct measures;weight count;run;
If the data is external to the program:data aspirin;infile ’k:/STAT5602/aspirin.txt’;input Group $ MI $ count;proc freq order
tables GROUP*MI / chisq expected cellchi2 nocol nopct measures;weight count;
Limoxifen: Developing a new drug as a supplement for hormonal treatment of breast cancer Background According to the guidelines of treating hormone-sensitive breast cancer, hormonal therapy is prescribed for 5 years to women who have already received other treatments. This improves their prognosis. However, hormonal therapy can produce numerous side effects, such as reduced sexual desire, joint pa
Journal of Jesuit Interdisciplinary Studies Modernity through the Prism of Jesuit HistoryProfessor Paul Grendler wrote recently that “When I look at all the new articles and books that the Jesuitica Project lists every week, I suspect that there is enough scholarship and interest in the history of the Society of Jesus and individual Jesuits to fill a new journal. I am particularly impressed w