## K:\stat5602-2013\lectures\sept

**STAT5602**

___________________________________________________________________________

**Two**-

**Way Contingency Tables**
**Joint**,

**Marginal and Conditional Distributions**
Consider

*X *and

*Y *two categorical response variables, with

*X *having

*I *levels and

*Y *having

*J*levels and suppose we classify each item in a population using both of these variables.

Responses (

*X*,

*Y*) corresponding to a randomly chosen item from this population have a jointprobability distribution. Let

*ij *denote the probability that

*X *assumes its

*ith *level and

*Y*
Consider the following

*I x J *contingency table:
is the

**joint distribution of ***X ***and ***Y *; it defines the

(bivariate) relationship between the two variables.

The

**marginal distributions of ***X ***and ***Y *are the row and column totals (respectively),

obtained by summing the joint probabilities. These are denoted by

The marginal distributions represent single-variable information; they

**do not refer to**

association links between the two variables.

are unknown but they can be estimated by sampling.

**STAT5602**

___________________________________________________________________________

*Example*: Consider a sample of 3566 individuals cross-classified by smoking status and sleep

problems. This yields a

**2 x 2 contingency table**.

Note: The

*overall sample size is fixed ***but **row and column totals are not fixed. Thus this

study corresponds to a multinomial sample with 4 outcomes. The maximum likelihood

estimates (M.L.E.s) of

Sometimes one variable can be thought of as a

*response *variable and the other as an

*explanatory *variable. (In this study, we might treat sleep problems as a response variable andsmoking status as an explanatory variable.) For such cases, it may be useful to construct aseparate probability distribution for

*Y *at each level of

*X*. Given that an item is classified inrow

*i *of

*X*, let
denote the probability of classification in column

*j *of

*Y*. This yields the

**STAT5602**

___________________________________________________________________________

represent the

**conditional **distribution of Y at the ith
*level of X*. The conditional distribution of

*Y *given

*X *is related to the joint distribution of
Usually,these conditional probability distributions are also unknown but they can beestimated by sampling.

For our example, we estimate the conditional probability distribution for sleep problems
at the

*ith *level of service in Vietnam using

**STAT5602**

___________________________________________________________________________

**Independence**

When both variables are

*response *variables, we can describe their association using:

- their

*joint *distribution,

- the

*conditional distribution of Y given X*

or

- the

*conditional distribution of X given Y*.

The variables

*X *and

*Y *are

**statistically independent **if

Thus, for

*X *and

*Y *independent,
When

*Y *is a

*response *and

*X *is an

*explanatory *variable, the condition
for all

*j *is a more natural definition of independence.

**Note**: In some tables where

*Y *is a response variable and

*X *is an explanatory variable,

*X *is

**fixed **rather than random. Then the notion of a

*joint *distribution for

*X *and

*Y *is no longer

meaningful. However, for a fixed level of

*X*,

*Y *has a probability distribution. Thus we can

consider the conditional distribution of

*Y *for different fixed levels of

*X*.

**Test for Homogeneity**:

**Prospective Study**
*Example*:

The Physicians’ Health Study was a 5 year study testing whether regular intake of aspirin

reduces mortality from cardiovascular disease. In this study, 22,071 physicians were

randomly assigned either to a group that was to take one aspirin tablet every other day or to a

group that was to take a placebo every other day. Of the 22,071 physicians, 11,034 were

assigned to receive the placebo and 11,037 were assigned to receive aspirin. The study was

blind - i.e. the physicians did not know which type of pill they were assigned to take. Of the

11,034 physicians taking the placebo, 189 suffered myocardial infarcation (MI) over the

course of the study (18 of i were fatal) while of the 11,037 taking aspirin, 104 suffered MI (5

of which were fatal). The results are summarized in the following 2 x 2

**contingency table**:

**STAT5602**

___________________________________________________________________________

Source: Preliminary report: Findings from the aspirin component
of the ongoing hysicians’ Health Study,

*N*.

*Engl*.

*J*.

*Med*. 318:

**Question**:

Is the proporton of physicians taking a placebo who suffer MI the same as the proportion ofphysicians taking aspirin who suffer MI?
This is an example of a

*prospective *study. (

*Note*: In a prospective study, the

*row totals are*

fixed.)

(NOTE: This study is a

**clinical trial**, since physicians are assigned to the placebo and aspirin

groups by the investigators. Another type of prospective study is a

**cohort study**, where the

researchers do

**not **assign individuals to groups. e.g. to study the effect of smoking on MI, a

researcher might select a sample of smokers independently of a sample of nonsmokers, but

the researcher does not

*assign *individuals to the smoking and nonsmoking groups.)

probability of suffering MI, given that the physician takes the placebo
probability of not suffering MI given that the physician takes the placebo
probabiility of suffering MI given that the physician takes aspirin
probability of not suffering MI given that the pysician takes aspirin

*j i *. This allows us to determine estimated expected
frequencies

*mij*. Pearson’s Chi-square test statistic can then be used here.

**STAT5602**

___________________________________________________________________________

so the

*kernel of the likihood *is .

The

*log likelihood of the kernel *is
Thus,

**under **H0,

Using this, the estimated frequencies are

**STAT5602**

___________________________________________________________________________

We obtain

*Pearson*’

*s X*2 . Recall that for large samples,

*X*2
1. The

*p*-value is approximately 0, so there is strong evidence against

*H*0.

A

*likelihood ratio Chi*-

*square test *can

**also **be used here.

First we maximize the likelihood under

*H*0; then we maximize the likelihood under

*H*0

as the ratio of these two maximized likelihoods.

For the test for homogeneity above, the kernel is
Recall that, when

*H*0 is assumed to be true, the kernel simply becomes
and the log likelihood of this kernel is maximized at
Consider now the kernel

*in the general context*.

*H*0

**STAT5602**

___________________________________________________________________________

For our example above,

*G*2 (called Wilks‘ statistic) is :

*nij *log

*nij*/

*mij *and

*mij*
1 (same as for Pearson’s Chi-square test).

with

*p *value of approximately 0, again concluding that there is strong evidence against

*H*0
Now we try to understand the nature of this difference in proportions of physicians taking

aspirin who suffer MI and those physicians taking a placebo who suffer MI. We do this by

examining

**confidence intervals**,

**relative risk**, and

**odds ratios**.

**STAT5602**

___________________________________________________________________________

**Large Sample Confidence Interval for 1 1**
We showed that the MLEs of 1 1 and 1 2 were
where

*n*1 and

*n*2 are fixed. Also,

*n*11 and

*n*21 are

**independent binomial random**

variables with means and variances

Consequently

*p*1 1 and

*p*1 2 are independent with means and variances

**STAT5602**

___________________________________________________________________________

For large samples, we can use the fact that

*p*1 1 and

*p*1 2 will be

*approximately *normallydistributed. Thus a 100 1
For our example, to obtain a a 95% confidence interval for 1 1
and thus a 95% confidence interval for 1 1
Noting that the interval does not contain 0 , this indicates that aspirin appears to diminish therisk of MI.

**STAT5602**

___________________________________________________________________________

**Relative Risk**
A difference between two proportions may have greater importance when both proportionsare near 0 or 1 than when they are near the middle. So, instead of studying the effect ofaspirin on MI by considering the difference 1 1
we could look at the

*relative risk*,
which is the ratio of the ”success” probabilities for the 2 groups.In this case, ”success”represents having MI.

1 2 (i.e. the response is not affected by the group) so
to estimate the population relative risk. For our data, we
1.82. This implies that the sample proportion
of MI cases was 82% higher for the group taking the placebo.

Note that a relative risof 1.0 corresponds to independence.

**Obtaining a 100 1**
%

*confidence interval for the *(

*population*)

**p**
**based on 1 1 **:

**p****1 2**
The problem here is that the distribution of 1 1
is

*highly skewed *unless our sample sizes are
extremely large. So instead, we obtain a confidence interval for log
To derive the confidence interval, we use the

**delta method**.

**STAT5602**

___________________________________________________________________________

**The delta method for a function of a random variable**:

Let

*Tn *be a statistic, depending on a sample of size

*n*. For large samples, suppose

*Tn *isapproximately normally distributed with mean 0 and variance
Using a

*Taylor series expansion *of

*g Tn *around , we can write
converges in probability to 0 as

*n*
*Now we want a confidence interval for log*
We start with the point estimator of

*log*
**STAT5602**

___________________________________________________________________________

For our example above, the 95%

*C*.

*I*. for

*log*
Now taking antilogs,a 95%

*C*.

*I*. for the relative risk
This means that we are 95% confident in stating that, after 5 years, the proportion of MIcases for physicians taking a placebo every second day is between 1.43 and 2.31 times theproportion of MI cases for physicians taking a single aspirin every second day.

Note: Sometimes we might want to estimate the ratio of the ”failure” probabilities
than the ratio of ”success” probabilities

**STAT5602**

___________________________________________________________________________

**Odds Ratio**

Another measure of association in contingency tables is the

*odds ratio*
Consider again the physician example above. Within row 1, the odds that the response is incolumn 1 instead of column 2 is
Similarly within row 2, the corresponding odds ratio is
1 then response 1 is more likely than response 2 in row

*i*.

Within-row conditional distributions are identical

*iff*
2 is called the

**odds **(

**or cross product**)

**ratio**
1 response is not affected by group.

We estimate the

*population odds ratio*
meaning the odds of MI are 83% higher for physicians in the placebo group.

**STAT5602**

___________________________________________________________________________

%

*C*.

*I*. for the population odds ratio
again, since the sampling distribution of
is

*highly skewed *except for extremely large
sample sizes, we first obtain a confidence interval for log

**STAT5602**

___________________________________________________________________________

This means we are 95% confident that, after 5 years, the odds of MI for physicians taking aplacebo every second day is between 1.44 and 2.33 times the odds of MI for physicianstaking aspirin.

**Relationship between Odds Ratio and Relative Risk**

Since

So when the probabilities of ”success” for both groups ( i.e. 1 1 and 1 2 ) are close to zero,the odds ratio and the relative risk are similar. (This happens for our physician example and,in general, for a

*rare condition*.)

**STAT5602**

___________________________________________________________________________

**SAS program for the physician example**.

*If the data is internal to the program*:data aspirin;input Group $ MI $ count;cards;Placebo YES 189Placebo No 10845Aspirin Yes 104Aspirin No 10933;proc freq order
tables GROUP*MI/ chisq expected cellchi2 nocol nopct measures;weight count;run;

*If the data is external to the program*:data aspirin;infile ’k:/STAT5602/aspirin.txt’;input Group $ MI $ count;proc freq order
tables GROUP*MI / chisq expected cellchi2 nocol nopct measures;weight count;

Source: http://mathstat.carleton.ca/~smills/2013-14/STAT5602/sept17-19.pdf

Limoxifen: Developing a new drug as a supplement for hormonal treatment of breast cancer Background According to the guidelines of treating hormone-sensitive breast cancer, hormonal therapy is prescribed for 5 years to women who have already received other treatments. This improves their prognosis. However, hormonal therapy can produce numerous side effects, such as reduced sexual desire, joint pa

Journal of Jesuit Interdisciplinary Studies Modernity through the Prism of Jesuit HistoryProfessor Paul Grendler wrote recently that “When I look at all the new articles and books that the Jesuitica Project lists every week, I suspect that there is enough scholarship and interest in the history of the Society of Jesus and individual Jesuits to fill a new journal. I am particularly impressed w