BIOL 3110

Biostatistics

Phil

Ganter

301 Harned Hall

963-5782

Blackberry flower

 

Analysis of Categorical Data

Chapter 10

Email me

Back to:

Course Page
TSU Home Page
Ganter Home Page

Unit Organization:

Problems:

Problems for homework (assume a 10. in front of each)

  • 3, 6, 13, 17, 21, 29, 31, 48, 52, 55, 65, 68, 70, 74, 88

Suggested Problems

  • 40 (Fishers), 42 (Fishers)

Go back to the first lecture and brush up on data types.

In this lecture, we will cover analysis of categorical data.

Categorical data is data that is sorted into different qualitative categories, not by a measured value.

Sex, color, genotype, are examples of categorical data. Each observation fits into one category (male or female; red, green or blue; AA, Aa, or aa).

Test for Goodness-of-Fit

A goodness-of-fit test tests whether or not the data conform to some prior expectation for the data.

The prior expectation can come from a model or from previous experience or can be a null expectation.

Hardy-Weinberg predicts that one should find p2 of AA, 2pq of Aa, and q2 aa individuals in a population, if p and q are the frequencies of A and a, respectively.

One might ask if the actual frequency of genotypes from a population agree with Hardy-Weinberg expectations.

This is a good case of prior expectations coming from a model.

If you have taken Ecology, then you have used a model, the Poisson distribution, to predict the distribution of plots with a particular number of trees. This is another example of using a model to predict proportions of outcomes within each category.

From my own work, I found that some populations of a pillbug (Armadillidum vulgare) had 50 % females and other populations had 85% females. If I sample a new population and find 60% females, does this proportion agree with either the 50% or the 85% expectation?

This is an example of prior expectations coming from prior experience.

There are three morphs of a water flea (Daphnia). An experimenter puts equal numbers of each morph into a tank and then lets a predator (a fish) prey on them for a standard length of time. The null hypothesis of no difference in predation rate would lead to an unchanged proportion of each prey morph after being preyed upon by the fish.

This is an example of a null expectation.

The way to test for goodness-of-fit is to use a Chi-square () test.

The test is based on measuring the deviation of the data from the expected outcome.

First, you must calculate the expected outcome, which is dependent on the circumstances of the test.

Then you subtract the expected outcomes from the observed outcomes.

If you were to sum these deviations, they would sum to 0 (this is an outcome of the fact that both the observed and expected columns must sum to n, the number of observations).

To avoid this, we have to get rid of the - sign somehow.

We could take the absolute value, which we will not use, or we could square all of the deviations, so that negative values become positive.

Square all deviations.

There is another problem. The size of the sum of deviations will depend on the size of n. We partially correct this by standardizing the deviation.

Divide each deviation by the expected value used to calculate that deviation.

Sum all standardized deviations

This sum is the value.

Where o = number of observations in a category, e = expected number of observations in a category, i is the index, and c is the number of categories.

Evaluation of the value

First, select an -value. This is a necessary step for any statistical test.

The -values are not normally distributed.

The distribution begins at 0 (the outcome when all observed are the same as expected values).

The right side is skewed so that the right tail extends to infinity.

The exact shape depends on the number of categories and the assumption that the deviations between the expected and observed values are due to random error only.

Think for a moment about how the -value is calculated and the distribution will make some sense.

As the -value increases, it means that the gap between the observed and expected values is increasing.

If the gap is due to random error only, then really large values of must be uncommon (low probability).

The tail to the right represents the probability of getting a -value equal to or larger than the calculated -value.

Thus, the tail probability is a measure of how unlikely a -value as large as the calculated value is, a measure of how likely your -value is given that it due to random error alone.

If this probability is too low, then you are forced to reject the idea that it is due to random error alone and accept that something else is contributing to the gap between observed and expected values.

You are forced to conclude that your expected values are not the correct values and your model is incorrect.

The null and alternative hypotheses are:

H0 : The -value is due to random error and the expected values are an accurate prediction of the observed values

HA : The -value is too large to be due to random error alone and the expected values are not an accurate prediction of the observed values

The question you must answer is "How unlikely is my -value?" This means you must look up the probability of a -value as large or larger than this in the tables in the back of the book.

d. f. = the number of categories - 1

NOTE THAT THIS IS NOT THE NUMBER OF OBSERVATIONS - 1

Nondirectional vs. Directional

Goodness-of-fit null hypotheses are called COMPOUND NULL HYPOTHESES because each of the expected values (except the last, which is fixed by the values of the others since the total number of observations is fixed) is an independent hypothesis.

This fact means that the test must be non-directional, as the deviation for each of the independent nulls can be either + or -, so no overall direction is necessarily true for the entire test. We just know the observed didn't fit the expected very well.

The exception to this is when the outcome categories are just two (a dichotomous variable). Here, since only one of the expected values is not fixed, then directionality is possible. For an example, see the 2 x 2 contingency table below.

Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the value.

More about the test

What happens when you increase the number of observations, but the percentage in each category does not change. Let's do it with flipping coins.

Here the model assumes a fair toss so that Pr{heads} = Pr{tails} = 0.5

We do the experiment twice, with 10 flips the first time and 100 flips the second. Each time, 70% of the flips were heads.

 First Experiment 
Category Observed Expected (O-E)^2/E
Heads 7 5 0.8
Tails 3 5 0.8
    
     Chi-square 1.6
  
 Second Experiment 
Category Observed Expected (O-E)^2/E
Heads 70 50 8
Tails 30 50 8
    
   Chi-square 16

Look at the difference in the -values. One is ten time the other. This will greatly affect your conclusions (with an value of 0.001 and d. f. = 1, a very conservative test, in only the second case will the null be rejected).

Is this the way is should be?

Definitely. The chance of being 20% off of the expected number of heads for only 10 flips with an honest coin is much greater than being 20% off with 100 flips. The first case is only 2 over the expected and the second case has 20 too many heads. Random error could easily lead to the first outcome but it's not very likely to produce the second outcome.

Sample size is still an important matter, even if, in this case, the degrees of freedom do not depend on the sample size but on the categories involved.

2 x 2 Contingency Tables

CONTINGENCY TABLES are tables where one variable supplies all of the columns and another variable supplies all of the rows.

They are called contingency tables because we are investigating if one variable's outcome is contingent (= dependent) on the other variable.

A 2 X 2 CONTINGENCY TABLE (said 2-by-2) is a special case where each variable has only two possible outcomes.

Each of the combinations of of the variables get a CELL of its own.

To test whether or not one variable is affecting the other, we need to have an idea of what to expect without the variable. This means we need an expected that we can calculate.

The assumption in a contingency table is that the proportion of variable A in each category of A should be the same, no matter which category of variable B we are looking at.

How do we estimate this proportion?

We do this by using the MARGINAL TOTALS, the totals at the margins of the table below:

  Variable A         Margin
Variable B Category A1 Category A2   
   
Category B1 21 4 25
Category B2 8 32 40
    Total #
Margin 29 36 65

The marginal totals for Variable B (25 and 40) divided by the grand total (65) give us our estimate of the frequency of categories B1 and B2 independent of Variable A.

The marginal totals for Variable A (29 and 36) divided by the grand total (65) give us our estimate of the frequency of categories A1 and A2 independent of Variable B.

So we can use these marginals to get the expected proportion of observations in each of the four cells:

(29/65) * (25/65) = 0.17

(29/65) * (40/65) = 0.27

(36/65) * (25/65) = 0.21

(36/65) * (40/65) = 0.34

The expected outcome then are (simply multiply the proportions by the total, 65):

Expected Outcomes
  Variable A    
Variable B Category A1 Category A2
Category B1 11.15 13.85
Category B2 17.85 22.15 Total
    65.00

You now have a set of observations and a set of expectations from which to calculate a . In this case it is 25.5

Evaluation of the value

First, select an -value. This is a necessary step for any statistical test.

The degrees of freedom are the number of rows -1 (r - 1) times the number of columns -1 (c - 1)

(r - 1) * (c - 1) = 1 * 1 = 1

The null and nondirectional alternative hypotheses are:

H0 : There is no dependency between Variable A and B

HA : The observed values of Variables A and B are interdependent

Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the value.

The directional alternative hypotheses are:

HA1 : A greater proportion of Category A1 is in Category B1 than Category A2

or

HA2 : A lesser proportion of Category A1 is in Category B1 than Category A2

First you have to check to see if the alternative has occurred.

If we choose HA1 above, then we would proceed because 21 of 28 A1 outcomes were in B1 but only 4 out of 36 A2 outcomes were in B1 and this is as predicted by HA1

If we choose H above, then we would not proceed because 21 of 28 A1 outcomes were in B1 but only 4 out of 36 A2 outcomes were in B1 and this is not as predicted by HA2

Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the *2 value.

Notice that we have look up a value twice the alpha-value, which makes the cut off -value smaller and, so, a smaller deviation will allow one to reject the null hypothesis.

What have we tested here?

We have asked if the two variables are independent of one another or if they are associated.

If we accept the null, we are saying that Variable A and B are INDEPENDENT of one another.

The outcome of A does not depend on B and vice-versa.

If we accept the nondirectional alternative hypothesis, we are saying that variables A and B are ASSOCIATED.

Association means that the outcome of one corresponds to the outcome of the other.

In our example, if you get A1, then you also expect B1, but if you get A2, then you expect B2.

FISHER'S EXACT TEST

This is a test that is an alternative to the for contingency tables.

Exact because it gives the exact probability of getting the cell values given the marginal totals.

H0: The probability of infection is independent of the genotype of the plant.

HA: (directional) the probability of infection is lower for aa than for other genotypes.

Suppose that there are two three genotypes but that the A allele is completely dominant. You think the aa genotype might be useful if it shows resistance masked by the dominant allele. So you set up an experiment to test this. Plots of plants are exposed to the fungal spores and the appearance of infected individuals is noted. Plots are monocultures of plant genotypes. The results:

  Genotype Frequency Margin  
Infected AA or Aa aa     Ways of getting 3 out of 16
            560
Yes 13 3 0.23 16 Ways of getting 10 out of 17
No 7 10 0.77 17   19448
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.019001
 
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 2 out of 16
            120
Yes 14 2 0.15 16 Ways of getting 11 out of 17
No 6 11 0.85 17   12376
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.002591
   
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 1 out of 16
            16
Yes 15 1 0.08 16 Ways of getting 12 out of 17
No 5 12 0.92 17   6188
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.000173
   
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 0 out of 16
            1
Yes 16 0 0.00 16 Ways of getting 13 out of 17
No 4 13 1.00 17   2380
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.000004

Only the top part of the table represents the outcome of the experiment (the data in black).

The data in maroon represents hypothetical situations discussed below.

We need to know how likely is this table, assuming that the marginal totals are fixed.

Remember, that, since the marginal totals are unchanging, if we know the probability of the outcomes for one category of one variable, we know the outcomes in the other cells, so we need to find the probability of the outcome in a single cell.

The likelihood of the table depends on the number of ways to construct the table with the given cell entries divided by the total number of ways to get the marginal totals

These "number of ways" are combinatorials, just like we worked with when learning the binomial.

nCj = n!/((j!)*(n-j)!)

The numerator is the product of the number of ways of getting 3 successes out of 16 trials (= 16!/((3!)*(13)!) = 560) times the number of ways to get 10 successes out of 17 trials (= 17!/((10!)*(7)!) = 19448) divided by the number or ways to get 13 out of 33 trials (= 33!/((3!)*(13)!) = 560) = 0.0019

But this ignores that there are situations which give one more support for the rejecting the null than the total experiment.

These are the situations in maroon above.

Each one represents an outcome that supports the directional HA , so we have to add the probability of these outcomes to the probability of the actual outcome.

Once this is done, we see that the probability of getting this outcome or one more in line with HA is:

Pr{this table) = 0.0019 + 0.00259 + 0.000173 + 0.000004 = 0.0218

If we have an alpha-value of 0.05, then we reject the null and accept the alternative.

Notice that this is a directional alternative. We will stop here and do the non-directional only if we have time.

Confidence Intervals for Differences between Probabilities in 2 x 2 tables

If you see the 2 x 2 table as two samples with two observations apiece, then you can ask if the probabilities differ by constructing a confidence interval for the difference between the probability of some outcome within each sample.

  Sample 1 Sample 2
Category 1 x1 x2
Category 2 n1 - x1 n2 - x2
   
Totals n1 n2

If we define n1 and n2 as the marginal totals of the columns (each column is a different sample), we can define p1 and p2 as the probability of getting category 1 in the two samples.

We can ask if this probability is the same in each sample.

Define and .

Note that the addition of 1 to the numerator and 2 to the denominator is a correction for bias at small sample size.

What is needed is a standard error of the difference and a z value (which depends on the -level you want for the confidence interval - we will use the 0.05 level recommended by the book, which is 1.96)

Then the confidence interval is:

Notice that this is a way of doing a parametric test on categorical data.

r x k Contingency Tables

There is no difference between this procedure and the 2 x 2 we have already done.

We just have more cells for which we need to calculate expected numbers of outcomes and we have to do the with more than 4 categories (r*c categories, to be exact).

An example will suffice:

We will expand the previous example to four rows by three columns.

Notice that the expected proportions still total to 1, and the observed and expected totals are still equal to one another.

Each expected proportion cell is still the product of the row and column marginal totals divided by the square of the total number of outcomes.

Actual Outcomes   
  Variable A Margin
Variable B Category A1 Category A2 Category A3  
Category B1 21 17 14 52
Category B2 16 11 8 35
Category B3 7 5 4 16
Category B4 3 0 1 4
     Total #
Margin 47 33 27 107
     
Expected Proportions   
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 0.21 0.15 0.12
Category B2 0.14 0.10 0.08
Category B3 0.07 0.05 0.04
Category B4 0.02 0.01 0.01 Total
     1.00
Expected Outcomes   
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 22.84 16.04 13.12
Category B2 15.37 10.79 8.83
Category B3 7.03 4.93 4.04
Category B4 1.76 1.23 1.01 Total
     107.00
Chi Square    
        0.1 0.1 0.1    
0.0 0.0 0.1
0.0 0.0 0.0
0.9 1.2 0.0 Total
       2.49
Pr{Greater chi-square) 0.87
df = (r-1)*(c-1) = 6

As you can see, there is no evidence that this table differs from the marginal expectations.

The is only 2.49 and the probability of such a large is 87%, far larger than the usual 0.5 -level (the d. f. = (r-1)*(c-1) = (4 - 1)*(3 - 1) = 6.

Notice that each Variable A category declines from A1 to A3, no matter which Variable B category you are looking at.

The Variable A trends are INDEPENDENT of Variable B and the difference between the observed and the expected is due to random error.

What if that were not true. What if one of the Variable B categories bucked the trend?

In the table below, Variable A had the opposite trend in Category B3, increasing from A1 to A3.

By the way, the expected proportions have been skipped as I have used the formula (RowMarginal*ColumnMarginal)/Total to go straight to the expected number of outcomes.

Actual Outcomes 
  Variable A Margin
Variable B Category A1 Category A2 Category A3  
Category B1 21 17 14 52
Category B2 16 11 8 35
Category B3 1 4 11 16
Category B4 3 0 1 4
     Total #
Margin 41 32 34 107
     
Expected Proportions
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 19.93 15.55 16.52
Category B2 13.41 10.47 11.12
Category B3 6.13 4.79 5.08
Category B4 1.53 1.20 1.27 Total
     107.00
Chi Square    
        0.1 0.1 0.4    
0.5 0.0 0.9
4.3 0.1 6.9
1.4 1.2 0.1 Total
        15.95
Pr{Greater chi-square) 0.01
df = (r-1)*(c-1) = 6

Look at the difference this change has made in the value of (once again, the d. f. = (r-1)*(c-1) = (4 - 1)*(3 - 1) = 6. Now the Pr{greater -value} <0.01, below the 0.05 -level.

This means that the trend in Variable A DEPENDS on which category of Variable B.

Variables A and B are NOT INDEPENDENT.

Paired Data and 2 x 2 Tables

If your data is paired, it may be possible to use categorical analysis to understand the independence/dependence between outcomes for paired data. An example will illustrate.

A researcher wants to know about the probability of attack of a newly developed bean variety by a fungal pathogen. The data comes from plots of the beans planted by farmers throughout Tennessee. Either the plot is attacked by the fungus or it is not. Data is collected for two years from the same plots and is presented in the table below.

The pairing comes from the same plots being utilized each year so individual plots can affect two datapoints.

     Second Year Infected?
yes no
First year Infected? yes 67 165
no 210 31
   yes no
First year Infected? yes n11 n12
no n21 n22
                   
(n12 - n21)2 2025
n12 + n21 375
chi-square 5.4
d. f. 1
Probability 0.02
alpha -value 0.05
conclusion reject null

 

n11 and n22 represent CONCORDANT pairs, those that did not have an infection either year or had it both years.

n12 and n21 represent DISCORDANT pairs, those that either developed the infection in the first year and lost it the second or developed it only in the second year.

H0 for this analysis is that the year did not make any difference in the probability of the plot of beans being attacked by the fungus.

H0 : a discordant pair is just as likely to be yes- no as no-yes.

H0 : Pr{yes-no} = Pr{no-yes} = 0.5

McNEMAR'S TEST

This test uses the test for the expected 0.5 distribution and is calculated as

= (n12 - n21)2/(n12 + n21), with 1 d. f.

In the case above, it appears that the years were not independent. It was more likely that plots that were infected in only one year were more likely to be infected in the second year than in the first.

Relative Risk and the Odds Ratio

One often hears on the news some reporter saying something like:

"A study just published in the Journal of the American Medical Association reports that listening to pop music increases a persons risk of dermatitis three times."

You almost never hear scientists in areas other than clinical medicine report their findings in this same way, but clinical researchers often do.

How do they determine this?

They are reporting RELATIVE RISK, a ratio of probabilities.

In the example above, if 300 of 2000 participants in the study who listened to pop music suffered dermatitis during the study, then:

Pr{dermatitis | pop listening} = 300/2000 = 0.15

Note that the vertical line means "given" so that you read Pr{dermatitis | pop listening} as the probability of contracting dermatitis given that one listens to pop music.

Suppose that:

Pr{dermatitis | no pop listening} = 100/2000 = 0.05

If these are the probabilities, then we can calculate the relative risk of contracting dermatitis as the ratio of these probabilities, or

Relative Risk = 0.15 / 0.05 = 3

The ODDS of something happening is another ratio, the probability of something happening divided by the probability of it not happening.

What are the odds of contracting dermatitis for pop listeners?

Pr{dermatitis}/Pr{not getting dermatitis} = (300/2000)/(1700/2000) = 300/1700 = 3/17

For non-pop listeners?

Pr{dermatitis}/Pr{not getting dermatitis} = (100/2000)/(1900/2000) = 100/1900 = 1/19

The ODDS RATIO is the ratio of the two odds or:

(3/17)/(1/19) = (3*19)/(17*1) = 57/17 = 3.35

The book has more on the odds ratio, but we haven't the time to go further than this.

Last updated March 30, 2006