BIOL 3110
Biostatistics Phil Ganter 302 Harned Hall 963-5782 |
|
Canis Bay Lake in Canada's Algonquin National Park |
Comparing 2 Independent Samples
Back to:
Academic
Page |
Tennessee
State Home page |
Bio
311 Page |
Ganter
home page |
Unit Organization:
Problems:
Problems for homework (Assume a 7. in front of each)
- 4, 5, 10, 19, 23, 30, 42, 44, 47, 51, 57, 64, 79, 82, 89, 96, 97
Suggested Problems (Assume a 7. in front of each)
- 1, 2, 11, 17, 18, 24, 27, 38, 46, 50, 54, 60, 66, 68, 83, 104
Independent Samples and the Difference Between Means
We often want to compare two different populations.
Control versus Experimental
Male versus Female
Old versus New
We do this by drawing random samples from each population and comparing the samples.
Populations must not overlap, so that the samples drawn are INDEPENDENT of one another.
Any population parameters may be compared, but we will, once again, concentrate on the mean as the best way to compare populations (this is not always true).
To do this, we will speak about a composite statistic (or parameter), called the DIFFERENCE BETWEEN MEANS
For populations this is: 1 - 2
For samples this is: 1 - 2
where 1 and 2 are used to identify the different populations or samples
Standard Error of the Difference Between Means
We will use the standard error of the mean, SExbar, to get the Standard Error of the difference between means (= SE(x1 - x2))
There are two ways to approach this.
One pools the variance of each sample to get an overall variance and then calculates the standard error from this pooled variance
The second is called the unpooled SE and it uses SE from both samples
If the 2 sample standard deviations are equal or if the sample sizes are equal, then the pooled and unpooled SEs are equal
When the sample sizes are unequal, then we must choose which to use
If the standard deviations of the POPULATIONS are EQUAL, then the pooled is the correct choice
However, the unpooled will close to the pooled SE
If the standard deviations of the POPULATIONS are UNEQUAL, then the unpooled is the correct choice
The book recommends that the unpooled be the only choice, because:
the potential problems caused by choosing the unpooled when the pooled is the correct choice are small and the unpooled estimate and pooled estimates are usually about the same size in this case
the potential problems caused by choosing the pooled when the unpooled is correct choice can be very serious, leading to false conclusions
So we will only work with the unpooled SE (the pooled SE formula is in the book)
0r
This second form is the same as the first if you substitute for SE using the definition of SE in the previous chapter.
Confidence Interval for the Difference Between Means
Once you have calculated the standard error of the difference, this is just an adaptation of the CI formula from the previous chapter:
Notice that I have chosen 95% as the confidence level. It can be something else, but then the t-value would have to be adjusted
Notice also that the CI is for the difference between parameters (population means) and is calculated from sample statistics
In order to look up the right t value, you need to know the degrees of freedom, which represents a bit of calculating:
SE1 and SE2 are the sample errors (the sample standard deviation divided by the square root of the sample size)
This formula, by the way, is one of the few differences between the previous edition of the book and this one. I was taught to use df = n1+n2-2, but this has been found to be too error prone, and the new formula has been substituted
This test is valid when:
each sample is a random sample
the populations are normally distributed if small populations (relaxed assumption for large populations)
Hypothesis Testing with the t-Test
What if you wanted to compare two means, say a control and an experimental sample mean in order to find out whether or not they were different?
You could calculate the CI for the difference between control and experimental means
If the CI included 0, then you might say that you are 95% confident (or 99% or 90%) that there is no difference between the control and experimental means
If the CI did not include 0, then you might conclude that you are 95% confident that there is a difference between the control and experimental means.
There is another, more formal way of doing this called HYPOTHESIS TESTING
The case in which there is no difference between means is called the NULL HYPOTHESIS and it is written:
H0: 1 = 2
The case in which there is a difference between means is called the ALTERNATIVE HYPOTHESIS and it is written:
Ha: 1 ∫ 2
Note that the experimental can be larger or smaller than the control unless we specify otherwise, as we do below.
You test whether or not to accept the null hypothesis.
If you accept the null, you automatically reject the alternative.
If you reject the null, you automatically accept the alternative.
The test is done by calculating a t-STATISTIC.
This is a measure of how many standard errors apart the two means are, analogous to the calculation of a z value (remember, analogous, not the equal to).
The t-statistic is used to judge whether or not such a large difference between the means is expected if the null hypothesis is true.
This judgement is done by comparing ts with a t-value (briefly called just t) that represents a maximum probability of making an error you are willing to accept
If ts > t, then we will reject H0 and conclude the experimental mean differs from the control mean
If ts < t, then we will accept H0 and conclude there is no difference between the experimental mean and the control mean
How do we know which is the correct t-value with which to compare ts? After all, a t value is associated with both a degrees of freedom and with a probability
We will use the formula above for the degrees of freedom
We call the probability a p-VALUE.
The p-value is the probability that you would get such a large ts value IF THE NULL HYPOTHESIS IS TRUE
When we look up t, we are finding the probability, say 5% or 1%, that a t would be at least that large if the null is true.
So, if our ts exceeds this value, then we can conclude that it is even less likely to be that big if the null is true (remember, if the null is true, the real difference should be zero).
Some things to note.
There is no reason to use the same p-value for all tests. If you want to be conservative and only reject when the difference between the means is really large, use a p-value of 0.01 or 0.001 instead of 0.05
The p-value makes no distinction between between the case that the experimental mean is largest and the case when the control mean is largest of the pair.
The t table had only values of the upper tail, so you have to use the column with ONE HALF OF THE P-VALUE, so that, if the p-value is 0.05, then you use the 0.025 column (using the 0.05 column would correspond to a p-value of 0.10).
This lack of distinction between larger or smaller may not be useful
If you are looking for an improvement of the 1-year survival rate for a cancer drug, you would not want to test for the case that the drug shortened the life span, only the case that increased the life span.
We will discuss cases when only one possibility is of interest in a section below.
Conditions for Validity of the t-test
These are essentially the same as for a confidence interval.
Each sample must be:
from an independent population
randomly chosen
much smaller than the population from which it is drawn
Each population must be:
normally distributed if the sample size is small
this is relaxed if the sample size is large (see the book on the central limit theorem to find out why)
Significance level, (alpha)
When doing a t-test, you calculate a ts value and compare it to some t-value taken from the t-table.
is the critical error rate you don't wish to exceed and it will tell you the t-value to use to compare to ts, once the degrees of freedom are known.
should be chosen before analyzing the data (not fair to chose a level that lets you conclude what you want after calculating your ts
By choosing an value, you then know where to go in the t-table to get the t-value to compare to ts
Above, the idea of error was introduced. This is not the error we mean by random error, but an error that lies in drawing a wrong conclusion.
Thus, if we choose an -value of 0.5, then we are saying that we are willing to go with a 5% chance of accepting H0 when we should reject it and accept Ha
There is another type of error that can be made, and the table below makes the distinction between the two.
H0 is true | H0 is false | |
You accept H0 | Type II error | |
You reject H0 | Type I error |
The t-test allows you to choose the TYPE I ERROR RATE
An example of the difference between the two types of error
Two new (fictitious) home tests for prostate cancer are submitted to the FDA for approval to sell them over the counter. Formulation A almost never misses the presence of the cancer but 80% of the people who test positive really don't have the cancer. Formulation B has a much better accuracy in that only 5% of those who test positive are false positives. However, 5% of the time, the second formulation fails to detect cancer in patients with cancer. Which do you approve if you work for the FDA?
If you consider not having cancer as the null hypothesis and having cancer the alternative, then we can assign the two cases error types.
If the patient does not have cancer, then H0 is true and a positive test for cancer means rejecting the (true) null hypothesis and accepting the (false) alternative, Ha. So Formulation A makes type I errors.
If the patient has cancer, H0 is then false. When the test results are negative you are accepting H0, even though it is false, therefore rejecting the (true) alternative. This is a type II error. Formulation B makes type I errors.
Which should you, the poor FDA employee, do? In this case, Type II errors lead to undetected cases of cancer. Type I error, since it is so common, might cause a panic of false positives.
Not sure what to do? Neither am I. Statistics will not solve all your problems.b
When you are not interested in the possibility that mean A is smaller than mean B, only if it is larger, then you want to use a ONE-WAY t-TEST.
You first modify the alternative hypothesis.
The null hypothesis is unchanged:
H0: 1 =2
The alternative is written one of two ways, depending on which possibility is of interest:
Ha: 1 > 2 or Ha: 1 < 2
Once you decide this (and YOU MUST CHOOSE THE APPROPRIATE ALTERNATIVE HYPOTHESIS BEFORE PERFORMING ANY ANALYSIS OF THE DATA) you need to alter the t-value you use.
Before, the area under the curve that represents the probability of making an error was found in both tails (to cover error in either direction)
Now, the error of interest is only in one direction (depending on Ha), so all of the area under the curve will be on the appropriate side
So, before, to get a 5% error rate, you used the 0.025 column, because the column labels refer to only the upper tail.
Now, to get a 5% error rate, you use the 0.05 column.
Remember, that if you choose the second Ha, your difference between the means is expected to be negative, and you must put a negative in front of the t-value because you want the lower tail, not the upper, and t-values on that side of the mean are negative.
Significance and Effect Size
After doing a t-test that rejected H0, what do you conclude?
Often, the conclusion is that there is a SIGNIFICANT DIFFERENCE between the means.
This only means that there is statistically demonstrated difference, not that the difference is necessarily important or useful.
If you have huge sample sizes, then the standard error can be very small compared to the standard deviation.
Look at it this way. There are two different cancer treatments.
The average number of months survival post-treatment for treatment A is 79 months with a standard deviation of 15 months.
The average months of survival for treatment B is 80.5 months with a standard deviation of 12 months.
The clinical trial that produced these statistics was done with a sample of 10,000 patients for each treatment. Therefore, the standard errors of A and B are 0.15 and 0.12, respectively.
A t-test shows a significant difference between the treatments. Do you consider the significant difference important? What if treatment A had no side effects and treatment B caused loss of hair? What if treatment A cost a tenth of treatment B?
Importance makes reference to the context in which the data were collected, statistical significance only refers to the outcome of a statistical test.
One way of assessing importance is to calculate and report EFFECT SIZE.
This is simply the difference between the means divided by the largest of the two sample standard deviations.
In the case above, effect size is 1.5 months/15 months = 0.1, so the difference between the two is a small fraction of the dispersion of the data
A second way is to calculate the confidence interval of the difference between the means.
With the confidence interval, you can may be able to judge the importance of the difference.
Planning for Adequate Power
When we pick an -value, we are picking the chance that we will reject H0 when it is true, a type I error.
This means we are minimizing the probability of reporting a difference between population means when none actually exists.
We have seen that a second error type exists: the error of accepting H0 when it is false, a type II error.
This is the error of reporting no difference between means when one actually exists.
The ability of a test to reject H0 when it is false is called the POWER of the test.
Given that we are comparing two normally distributed independent populations with equal standard deviations and we are doing the comparisons by drawing random samples of equal size, then we can consider the factors that influence the power of a test.
-value
There is an inverse relationship between and the probability of making a type II error.
If you choose to lessen the type I error rate by using a small , it comes at the expense of increasing the probability of making a type II error.
If you reduce your chance of accepting a false H0, then you increase the chance of rejecting a true H0.
Larger populations standard deviations mean that the sample standard deviations are expected to be larger and so will standard errors of the mean.
Larger standard errors of the mean lead to larger t-test statistics (the t-statistic denominator is the standard error) and less chance that you will reject H0, and, thus, a greater chance of a type II error (=less power).
Difference in means
Smaller differences between sample means reduce the power of a test.
Remember that the t-statistic is a ratio of the difference between the means to the standard error.
If you decrease the size of the numerator, the ratio will decrease in size, thus making it harder to reject H0 (= less power)
Sample size
We have seen that large standard deviations reduce power because they increase the size of the standard error.
Standard errors also depend on sample size but, because sample size is in the denominator, larger sample sizes will decrease standard errors and increase the power of the test.
If you look at these four factors, you will see that the only one we exert control over is the sample size.
(Assuming that we are being as careful as possible when doing the sampling to minimize error introduced during the experiment.)
Planning for power means choosing a sample size that will produce an acceptable chance of a type II error.
To plan you have to:
- Choose an .
- Know enough to make a reasonable guess about the population standard deviations.
- Make an estimate of the effect size (simplified by the assumption of equal standard deviations for the populations).
With these three numbers, you can look up a recommended sample size in Table 5 in the back of the book.
Note that the predicted trends are there in the table.
As goes down, larger sample sizes are needed.
As effect size goes up, smaller sample sizes are needed.
Also, as power goes up, larger sample sizes are needed.
Alternative Methods: the Wilcoxson-Mann-Whitney Test
This test is often used when either the assumptions of the t-test are not met or when it is impossible to determine if the assumptions have been satisfied
It is NONPARAMETRIC
It tests for a difference between the samples but not for a difference in a specific parameter (the t-test is for a difference in the sample means)
It is DISTRIBUTION-FREE
No assumptions are made about the shape of the distribution of the population or sample.
The only assumptions are that the samples be randomly drawn from independent populations.
The test looks for a difference between the distributions from two samples.
It does this by determining the probability of getting more of the small observations in one sample than in the other.
Because only the rank of the observations are used and not their absolute size, we say that this test does not use all of the information in a sample.
This may mean that it is less able to detect differences between populations (= reject H0) than a parametric test like the t-test, especially when sample sizes are small (see below).
H0: There is no difference between the distributions of the two populations from which the samples have been drawn
The alternative may be either directional or nondirectional:
non-directional Ha: The distributions of the two populations from which the samples have been drawn are different
directional Ha: The members of population tend to have larger values than those in population B
The test works by measuring overlap between the size of sample observations.
the statistic that measures this is Us
Method of calculating Us
- Order each sample from smallest to largest
- Determine K1 and K2
- For each observation in sample 1, count the number of observations in sample 2 that are smaller. Tied observations count as 1/2. Sum the counts to get K1
- Do the same for the observations in sample 2 to get K2
- Check to see that there are no errors by adding K1 and K2. Their total should equal the product of the two sample sizes. If not, an error has been made
- Us is simply the larger of the two K values
The distribution of the critical value can be looked up in a table at the back of the book (this distribution does not seem to be in the MSExcel function list).
Because the K values are discrete, the probability distribution of Us is not a continuous curve, like the normal, but a histogram, like the binomial.
This means that not all probabilities are possible.
The probabilities reported across the top of the table are limits and the K values below are the largest K value with a probability less than (or, rarely, equal to) the probability listed at the top of the table
The discrete nature of the distribution of Us also means that, when the sample sizes are small, that there may be no K value with a probability small enough to use a small critical value (say, 0.01 or so).
For example, if the probability of the largest K is 0.15, the you will not be able to reject H0 with an -value of 0.01
Conditions for Valitdity of the Wilcox-Mann-Whitney Test
Each sample must be:
Last updated September 12, 2006