BIOL 3110 Comparison of Two Samples

BIOL 3110

Biostatistics

Phil

Ganter

302 Harned Hall

963-5782

Canis Bay Lake in Canada's Algonquin National Park

Comparing 2 Independent Samples

Email me

Back to:

`Academic``Page`	`Tennessee State` `Home page`
`Bio 311` `Page`	`Ganter``home page`

Unit Organization:

Independent Samples and the difference between means
Standard Error of the Difference Between Means
Confidence Interval for the Difference Between Means
Hypothesis Testing #1: the t-Test
Error and Power
One-Tailed t-Tests
Significance and Effect Size
Planning for Adequate Power
Alternate Methods: the Wilcoxson-Mann-Whitney Test

Problems:

Problems for homework (Assume a 7. in front of each)

4, 5, 10, 19, 23, 30, 42, 44, 47, 51, 57, 64, 79, 82, 89, 96, 97

Suggested Problems (Assume a 7. in front of each)

1, 2, 11, 17, 18, 24, 27, 38, 46, 50, 54, 60, 66, 68, 83, 104

Independent Samples and the Difference Between Means

We often want to compare two different populations.

Control versus Experimental

Male versus Female

Old versus New

We do this by drawing random samples from each population and comparing the samples.

Populations must not overlap, so that the samples drawn are INDEPENDENT of one another.

Any population parameters may be compared, but we will, once again, concentrate on the mean as the best way to compare populations (this is not always true).

To do this, we will speak about a composite statistic (or parameter), called the DIFFERENCE BETWEEN MEANS

For populations this is: ₁ - ₂

For samples this is: ₁^{^-} ₂

where 1 and 2 are used to identify the different populations or samples

Standard Error of the Difference Between Means

We will use the standard error of the mean, SE_xbar, to get the Standard Error of the difference between means (= SE_{(x₁
- x₂)})

There are two ways to approach this.

One pools the variance of each sample to get an overall variance and then calculates the standard error from this pooled variance

The second is called the unpooled SE and it uses SE from both samples

If the 2 sample standard deviations are equal or if the sample sizes are equal, then the pooled and unpooled SEs are equal

When the sample sizes are unequal, then we must choose which to use

If the standard deviations of the POPULATIONS are EQUAL, then the pooled is the correct choice

However, the unpooled will close to the pooled SE

If the standard deviations of the POPULATIONS are UNEQUAL, then the unpooled is the correct choice

The book recommends that the unpooled be the only choice, because:

the potential problems caused by choosing the unpooled when the pooled is the correct choice are small and the unpooled estimate and pooled estimates are usually about the same size in this case

the potential problems caused by choosing the pooled when the unpooled is correct choice can be very serious, leading to false conclusions

So we will only work with the unpooled SE (the pooled SE formula is in the book)

This second form is the same as the first if you substitute for SE using the definition of SE in the previous chapter.

Confidence Interval for the Difference Between Means

Once you have calculated the standard error of the difference, this is just an adaptation of the CI formula from the previous chapter:

Notice that I have chosen 95% as the confidence level. It can be something else, but then the t-value would have to be adjusted

Notice also that the CI is for the difference between parameters (population means) and is calculated from sample statistics

In order to look up the right t value, you need to know the degrees of freedom, which represents a bit of calculating:

SE₁ and SE₂ are the sample errors (the sample standard deviation divided by the square root of the sample size)

This formula, by the way, is one of the few differences between the previous edition of the book and this one. I was taught to use df = n₁+n₂-2, but this has been found to be too error prone, and the new formula has been substituted

This test is valid when:

each sample is a random sample

the populations are normally distributed if small populations (relaxed assumption for large populations)

Hypothesis Testing with the t-Test

What if you wanted to compare two means, say a control and an experimental sample mean in order to find out whether or not they were different?

You could calculate the CI for the difference between control and experimental means

If the CI included 0, then you might say that you are 95% confident (or 99% or 90%) that there is no difference between the control and experimental means

If the CI did not include 0, then you might conclude that you are 95% confident that there is a difference between the control and experimental means.

There is another, more formal way of doing this called HYPOTHESIS TESTING

The case in which there is no difference between means is called the NULL HYPOTHESIS and it is written:

H₀: ₁ = ₂

The case in which there is a difference between means is called the ALTERNATIVE HYPOTHESIS and it is written:

H_a: ₁ ∫ ₂

Note that the experimental can be larger or smaller than the control unless we specify otherwise, as we do below.

You test whether or not to accept the null hypothesis.

If you accept the null, you automatically reject the alternative.

If you reject the null, you automatically accept the alternative.

The test is done by calculating a t-STATISTIC.

This is a measure of how many standard errors apart the two means are, analogous to the calculation of a z value (remember, analogous, not the equal to).

The t-statistic is used to judge whether or not such a large difference between the means is expected if the null hypothesis is true.

This judgement is done by comparing t_s with a t-value (briefly called just t) that represents a maximum probability of making an error you are willing to accept

If t_s> t, then we will reject H₀ and conclude the experimental mean differs from the control mean

If t_s< t, then we will accept H₀ and conclude there is no difference between the experimental mean and the control mean

How do we know which is the correct t-value with which to compare t_s? After all, a t value is associated with both a degrees of freedom and with a probability

We will use the formula above for the degrees of freedom

We call the probability a p-VALUE.

The p-value is the probability that you would get such a large t_s value IF THE NULL HYPOTHESIS IS TRUE

When we look up t, we are finding the probability, say 5% or 1%, that a t would be at least that large if the null is true.

So, if our t_s exceeds this value, then we can conclude that it is even less likely to be that big if the null is true (remember, if the null is true, the real difference should be zero).

Some things to note.

There is no reason to use the same p-value for all tests. If you want to be conservative and only reject when the difference between the means is really large, use a p-value of 0.01 or 0.001 instead of 0.05

The p-value makes no distinction between between the case that the experimental mean is largest and the case when the control mean is largest of the pair.

The t table had only values of the upper tail, so you have to use the column with ONE HALF OF THE P-VALUE, so that, if the p-value is 0.05, then you use the 0.025 column (using the 0.05 column would correspond to a p-value of 0.10).

This lack of distinction between larger or smaller may not be useful

If you are looking for an improvement of the 1-year survival rate for a cancer drug, you would not want to test for the case that the drug shortened the life span, only the case that increased the life span.

We will discuss cases when only one possibility is of interest in a section below.

Conditions for Validity of the t-test

These are essentially the same as for a confidence interval.

Each sample must be:

from an independent population

randomly chosen

much smaller than the population from which it is drawn

Each population must be:

normally distributed if the sample size is small

this is relaxed if the sample size is large (see the book on the central limit theorem to find out why)

Significance level, (alpha)

When doing a t-test, you calculate a t_s value and compare it to some t-value taken from the t-table.

is the critical error rate you don't wish to exceed and it will tell you the t-value to use to compare to t_s, once the degrees of freedom are known.

should be chosen before analyzing the data (not fair to chose a level that lets you conclude what you want after calculating your t_s

By choosing an value, you then know where to go in the t-table to get the t-value to compare to t_s

Error Types and Power

Above, the idea of error was introduced. This is not the error we mean by random error, but an error that lies in drawing a wrong conclusion.

Thus, if we choose an -value of 0.5, then we are saying that we are willing to go with a 5% chance of accepting H₀ when we should reject it and accept H_a

There is another type of error that can be made, and the table below makes the distinction between the two.

	H₀is true	H₀is false
You accept H₀		Type II error
You reject H₀	Type I error

The t-test allows you to choose the TYPE I ERROR RATE

An example of the difference between the two types of error

Two new (fictitious) home tests for prostate cancer are submitted to the FDA for approval to sell them over the counter. Formulation A almost never misses the presence of the cancer but 80% of the people who test positive really don't have the cancer. Formulation B has a much better accuracy in that only 5% of those who test positive are false positives. However, 5% of the time, the second formulation fails to detect cancer in patients with cancer. Which do you approve if you work for the FDA?

If you consider not having cancer as the null hypothesis and having cancer the alternative, then we can assign the two cases error types.

If the patient does not have cancer, then H₀ is true and a positive test for cancer means rejecting the (true) null hypothesis and accepting the (false) alternative, H_a. So Formulation A makes type I errors.

If the patient has cancer, H₀is then false. When the test results are negative you are accepting H₀, even though it is false, therefore rejecting the (true) alternative. This is a type II error. Formulation B makes type I errors.

Which should you, the poor FDA employee, do? In this case, Type II errors lead to undetected cases of cancer. Type I error, since it is so common, might cause a panic of false positives.

Not sure what to do? Neither am I. Statistics will not solve all your problems.b

One-Tailed t-Tests

When you are not interested in the possibility that mean A is smaller than mean B, only if it is larger, then you want to use a ONE-WAY t-TEST.

You first modify the alternative hypothesis.

The null hypothesis is unchanged:

H₀: ₁ =₂

The alternative is written one of two ways, depending on which possibility is of interest:

H_a: ₁ > ₂or H_a: ₁ < ₂

Once you decide this (and YOU MUST CHOOSE THE APPROPRIATE ALTERNATIVE HYPOTHESIS BEFORE PERFORMING ANY ANALYSIS OF THE DATA) you need to alter the t-value you use.

Before, the area under the curve that represents the probability of making an error was found in both tails (to cover error in either direction)

Now, the error of interest is only in one direction (depending on H_a), so all of the area under the curve will be on the appropriate side

So, before, to get a 5% error rate, you used the 0.025 column, because the column labels refer to only the upper tail.

Now, to get a 5% error rate, you use the 0.05 column.

Remember, that if you choose the second H_a, your difference between the means is expected to be negative, and you must put a negative in front of the t-value because you want the lower tail, not the upper, and t-values on that side of the mean are negative.

Significance and Effect Size

After doing a t-test that rejected H₀, what do you conclude?

Often, the conclusion is that there is a SIGNIFICANT DIFFERENCE between the means.

This only means that there is statistically demonstrated difference, not that the difference is necessarily important or useful.

If you have huge sample sizes, then the standard error can be very small compared to the standard deviation.

Look at it this way. There are two different cancer treatments.

The average number of months survival post-treatment for treatment A is 79 months with a standard deviation of 15 months.

The average months of survival for treatment B is 80.5 months with a standard deviation of 12 months.

The clinical trial that produced these statistics was done with a sample of 10,000 patients for each treatment. Therefore, the standard errors of A and B are 0.15 and 0.12, respectively.

A t-test shows a significant difference between the treatments. Do you consider the significant difference important? What if treatment A had no side effects and treatment B caused loss of hair? What if treatment A cost a tenth of treatment B?

Importance makes reference to the context in which the data were collected, statistical significance only refers to the outcome of a statistical test.

One way of assessing importance is to calculate and report EFFECT SIZE.

This is simply the difference between the means divided by the largest of the two sample standard deviations.

In the case above, effect size is 1.5 months/15 months = 0.1, so the difference between the two is a small fraction of the dispersion of the data

A second way is to calculate the confidence interval of the difference between the means.

With the confidence interval, you can may be able to judge the importance of the difference.

Planning for Adequate Power

When we pick an -value, we are picking the chance that we will reject H₀ when it is true, a type I error.

This means we are minimizing the probability of reporting a difference between population means when none actually exists.

We have seen that a second error type exists: the error of accepting H₀ when it is false, a type II error.

This is the error of reporting no difference between means when one actually exists.

The ability of a test to reject H₀when it is false is called the POWER of the test.

Given that we are comparing two normally distributed independent populations with equal standard deviations and we are doing the comparisons by drawing random samples of equal size, then we can consider the factors that influence the power of a test.

-value

There is an inverse relationship between and the probability of making a type II error.

If you choose to lessen the type I error rate by using a small , it comes at the expense of increasing the probability of making a type II error.

If you reduce your chance of accepting a false H₀, then you increase the chance of rejecting a true H₀.

Larger populations standard deviations mean that the sample standard deviations are expected to be larger and so will standard errors of the mean.

Larger standard errors of the mean lead to larger t-test statistics (the t-statistic denominator is the standard error) and less chance that you will reject H₀, and, thus, a greater chance of a type II error (=less power).

Difference in means

Smaller differences between sample means reduce the power of a test.

Remember that the t-statistic is a ratio of the difference between the means to the standard error.

If you decrease the size of the numerator, the ratio will decrease in size, thus making it harder to reject H₀(= less power)

Sample size

We have seen that large standard deviations reduce power because they increase the size of the standard error.

Standard errors also depend on sample size but, because sample size is in the denominator, larger sample sizes will decrease standard errors and increase the power of the test.

If you look at these four factors, you will see that the only one we exert control over is the sample size.

(Assuming that we are being as careful as possible when doing the sampling to minimize error introduced during the experiment.)

Planning for power means choosing a sample size that will produce an acceptable chance of a type II error.

To plan you have to:

Choose an .

Know enough to make a reasonable guess about the population standard deviations.

Make an estimate of the effect size (simplified by the assumption of equal standard deviations for the populations).

With these three numbers, you can look up a recommended sample size in Table 5 in the back of the book.

Note that the predicted trends are there in the table.

As goes down, larger sample sizes are needed.

As effect size goes up, smaller sample sizes are needed.

Also, as power goes up, larger sample sizes are needed.

Alternative Methods: the Wilcoxson-Mann-Whitney Test

This test is often used when either the assumptions of the t-test are not met or when it is impossible to determine if the assumptions have been satisfied

It is NONPARAMETRIC

It tests for a difference between the samples but not for a difference in a specific parameter (the t-test is for a difference in the sample means)

It is DISTRIBUTION-FREE

No assumptions are made about the shape of the distribution of the population or sample.

The only assumptions are that the samples be randomly drawn from independent populations.

The test looks for a difference between the distributions from two samples.

It does this by determining the probability of getting more of the small observations in one sample than in the other.

Because only the rank of the observations are used and not their absolute size, we say that this test does not use all of the information in a sample.

This may mean that it is less able to detect differences between populations (= reject H₀) than a parametric test like the t-test, especially when sample sizes are small (see below).

H₀: There is no difference between the distributions of the two populations from which the samples have been drawn

The alternative may be either directional or nondirectional:

non-directional H_a: The distributions of the two populations from which the samples have been drawn are different

directional H_a: The members of population tend to have larger values than those in population B

The test works by measuring overlap between the size of sample observations.

the statistic that measures this is U_s

Method of calculating U_s

Order each sample from smallest to largest

Determine K₁ and K₂

For each observation in sample 1, count the number of observations in sample 2 that are smaller. Tied observations count as 1/2. Sum the counts to get K₁

Do the same for the observations in sample 2 to get K₂

Check to see that there are no errors by adding K1 and K2. Their total should equal the product of the two sample sizes. If not, an error has been made

U_s is simply the larger of the two K values

The distribution of the critical value can be looked up in a table at the back of the book (this distribution does not seem to be in the MSExcel function list).

Because the K values are discrete, the probability distribution of U_s is not a continuous curve, like the normal, but a histogram, like the binomial.

This means that not all probabilities are possible.

The probabilities reported across the top of the table are limits and the K values below are the largest K value with a probability less than (or, rarely, equal to) the probability listed at the top of the table

The discrete nature of the distribution of U_s also means that, when the sample sizes are small, that there may be no K value with a probability small enough to use a small critical value (say, 0.01 or so).

For example, if the probability of the largest K is 0.15, the you will not be able to reject H₀ with an -value of 0.01

Conditions for Valitdity of the Wilcox-Mann-Whitney Test

Each sample must be:

randomly drawn
from an independent population

Last updated September 12, 2006