BIOL 3110 Confidence Intervals

BIOL 3110

Biostatistics

Phil

Ganter

320 Harned Hall

963-5782

Confidence Intervals

Email me

Back to:

`Academic``Page`	`Tennessee State` `Home page`
`Bio 311` `Page`	`Ganter``home page`

Unit Organization:

Statistical Estimation
Standard Error
Confidence Interval for the true mean
How large a sample size?
Validity of Confidence Intervals
Confidence Interval for a proportion

Problems:

Problems for homework (assume a 6. in front of each)

3, 7, 10, 25, 29, 32, 34, 36, 52, 58

Suggested Problems (assume a 6. in front of each)

1, 2, 12, 14, 16, 18, 27, 31, 39, 43, any from the last section of problems

Statistical Estimation

This is the reason for doing statistics in the first place.

Statistical estimation has two components

estimation of some population parameter (mean, st. dev., shape of the distribution, etc.)
determination of the precision of the estimate (how likely is it to be correct?)

In this lecture, we will learn how to construct confidence intervals, which depend on us knowing two things:

an estimate of the true mean
an estimate of the spread of the data about the true mean

As discussed before, the best estimate of the true mean is a sample mean (or, better, a mean of sample means)

this leaves us with the problem of estimating the true standard deviation (it's not exactly s)

Standard Error of the mean

First, let's recall what the standard deviation is:

a measure of the dispersion (=spread) of the data around the mean
when there are lots of big and small values but few in the middle, the st .dev. is larger than when most of the data is near the mean

The standard deviation of the sample is related to but not equal to the standard deviation of the population.

The standard deviation of the sample is larger because a sample is an estimate and, to be conservative about what we think we know, we divide by n-1 rather than n.

The standard deviation of means of samples drawn from a population is likely to be smaller than the standard deviation estimated from a single sample or even smaller than the true population standard deviation

When means are calculated from samples drawn randomly from a population, they will most often be closer to the true mean than will a single data point drawn from the population at random

Thus, a standard deviation calculated from 10 means will be smaller than a standard deviation calculated from 10 values drawn at random from the population

To calculate the standard error of means, we need a formulation that will guarantee that it is smaller than the population standard deviation ()

There is a second consideration and that is that samples are not all equally good

Means calculated from large samples are more likely to be nearer the true mean () than those calculated from small samples.

If this is so, then a standard deviation calculated from small sample means (which are clustered less tightly about the true mean) should be larger than a standard deviation calculated from large means (which are clustered more tightly about the true mean)

So, taking both considerations into effect, we use this equation:

Note that we use the square root of the sample size (remember that is a square root also)

A practical problem is embedded in the above definition of the standard error of sample means.

To calculate the standard deviation of the means, we needed the standard deviation of the population

This is not usually something we know, so we need a fix

So, what can we use to estimate sigma, ()?

If we get something to estimate sigma with that we can actually measure, then we can use it to find the probability of a sample mean being close to the actual mean ()

The most obvious estimate, and the one we will use is s, the standard deviation of the sample, which we will substitute into the formula for the standard deviation of the sample means

We call the standard error (SE) of the mean, not the standard deviation of the mean to distinguish between something calculated from the data and something calculated from a statistic (the sample means).

Difference between SE and SD

Standard deviation of the sample - refers to how the individual data points are distributed with respect to the mean, it is a measure of data dispersion.

Remember how it is computed (as the square root of the average [corrected for degrees of freedom] of the squared deviations of the data points from the mean [remember we square as a means of making all of the deviations positive])

Standard Error of the mean - refers to the probability that sample means, not individual data points, differ from the true population mean

The book states the same thing with different words. The book defines the SE as a measure of uncertainty due to sampling (random) error in how good a sample mean is as a measure of the true mean

A larger SE means there is more uncertainty in using the sample mean as an estimator of the true (population) mean

Remember that SE is related to s, but that it is always smaller than s

The divisor means that larger samples have a smaller SE, that is, that they are expected to be closer to the true mean than are sample means from smaller samples

Confidence Interval for the True Mean

A confidence interval is a range of values between which we believe the value of interest to lie

for us , the mean of the population, is the value of interest

The size of the range depends on two things

how sure we want to be

if we want to me more sure, then we must have a larger range

you can be somewhat sure that the true mean of the student age at TSU is between 20 and 30

you can be totally sure that the true mean of the student age at TSU is between 1 and 100

how much error variation there is the original population

If we knew , the population standard error, we could get the range for , the sample mean, given that the population is normally distributed

if you want to know how big a confidence interval you need to be 75% sure that the mean is in the value you have to find the z values that have 75% of the area under the normal curve between them (see below, to see that these values are about -1.03 and 1.03)

Then one would find X, the actual numbers, from the z values (remember how you calculate a z) as done below

The confidence interval when sigma is known:

Start with the proposition that the probability of Z being between -1.03 and 1.03 is 75%

Pr{ -1.03 < Z < 1.03} = 0.75

Substitute the formula for calculating Z based on the sample mean (because it has and in it, and we want to know how one relates to the other)

Pr{-1.03 < < 1.03} = 0.75

Now do some simple algebra to find out where should lie

first, multiply by the denominator, , to eliminate it from the middle term and then subtract to remove it from the middle term, last multiply by -1 )

Pr{-1.03* < - < 1.03*} = 0.75

then subtract from all 3 terms and multiply by -1 to turn a negative into a positive (with the appropriate changes in the direction of the inequality signs) and you get

Pr{-1.03* < < +1.03*} = 0.75

This last expression is a confidence interval.

It says that there is a 75% chance of the true mean being between the sample mean minus a term (based on a z value and the standard error of sample means) and the sample mean plus the same term.

We have done it, we have found out where the true mean is with a confidence level of 75% but there is a fly in the ointment, a bit of unfinished business

How can we find unless we know ?

We need an estimator of and we will turn to the same place as before, the SE of the mean (which was discussed above)

There is a second problem, because the normal was calculated with , not with SE or S

It turns out that the distribution of means follows a curve called Student's t, which is similar to the normal

it has a larger standard deviation term than does the normal

the difference between the t distribution and the normal is dependent on the sample size, such that smaller sample sized are less like the normal and larger are more similar

when sample size is infinitely large, the t and the normal distributions are identical

Calculating a confidence interval using the t-distribution

We can re-write the last equation for a confidence interval now

Pr{-t_{_0.75}*SE_x ¾ ¾ +t_{_0.75}*SE_x} = 0.75

so we need to calculate

± t_{_0.75}*SE_x

We know how to do the sample mean and SE (from above), but what is t?

t refers to the student's-t distribution. It is a platykurtic (flattened) version of the normal distribution. There is more than one student's-t distribution. In fact, each sample size produces a unique t-distribution.

The shape of the distribution changes with n, the sample size. As n gets larger, the student's-t distribution becomes more and more similar to the normal (in fact, when n is infinitely large, they are the same.

In the figure below, the normal curve is in pink and is the curve with the highest peak that falls most steeply. Compare it with the student-t curve for a sample size of 1 (k is the degrees of freedom, which depends on the sample size - see below), which is black. The x-axis units are standard deviations and the y-axis is probability.

diagram from Wikipedia, Student's t Distribution entry, used here under GNU license

The peak of the student's-t distribution (black) is lower than the normal (pink) and the tails on either side for the student's t distribution are higher than for the normal distribution.

This means that, if I compare the areas under the curves that are less than -2 sd, then the area will be larger for the student-t distribution than for the normal.

Consider this: Which curve has more area within ± 1 sd of the mean (= 0 here). Since there is more area out in the tails for the student's-t distribution, then there must be less in the center, so it is the normal with more area (= greater probability) within a standard deviation of the mean.

This makes sense. A sample provides an estimate of the population. Estimates are not as accurate and so a curve based on estimates should have more "spread." As the size of the sample increases, the estimate gets better and better, which happens here because the student's-t distribution becomes more like the normal is n increases.

We can look the cumulative areas (probabilities) up in table 4, where the table body is the upper tail probability of the t-distribution and the rows an columns depend on the degrees of freedom and the critical value you want to use

Degrees of freedom are n-1 when only one parameter is being estimated (we are trying to estimate only )

Critical value is the probability that the true mean will exceed the confidence interval

for the example above, the critical value was 25% (= 1 - 75%)

The table lists only the upper tail, and we are concerned about being both too small as well as too large, so we need to divide the critical value in half to look it up

if you want a critical value of 5%, you have to look up 0.025, not 0.05

How Large a Sample Size

This is an important question to ask when designing experiments.

If you will be using statistics to evaluate the results, then you don't want

to have too few data points to show a difference between experimental and controls

to have more data points than is necessary to show a difference between experimental and controls

The first instance can be a disaster

the second may be an inconvenience (doing more than needs to be done) or can mean that you get fewer experiments done because you are wasting effort

We will use the formula for the standard error of the mean to estimate n (see above for this formula)

You can't do this without some sort of guessing, but the guessing should make use of prior knowledge and your expertise.

Consider what you want the ± portion of the confidence interval to be (see example below)

You are measuring the concentration of a protein, and you think that the experimental cells might have as much as 20% more than the control cells. You will have to set up a series of flasks in which to rear cells and will measure the concentration of the protein in each flask. How many flasks do you need to set up?

You know that the control cells produce (from the literature or from previous work) about 25 picograms per microliter of protein with a standard deviation of 7 picograms per microliter.

You want to be 95% confident that your estimate of the concentrations will be within 5 picograms per microliter of the true mean

The 95% is an arbitrary choice, but it means that you have only a 1 in 20 chance of being wrong

You chose the 5 value because you expect the experimental to only be about 5 microliters above the controls (20%).

So you want the ± portion of the confidence interval to be no larger than 5

the ± portion is t_{_0.025}*SE_x,

t has the subscript of 0.025 because you want to be 95% confident, so the critical value is 5% (1-0.95) and this is divided in half because the table has only the upper tail probability (=area under the curve), and you are concerned about both the upper and the lower tails (missing by being too large or too small an estimate)

from the table, t_{_0.025}=~2 (you don't know the degrees of freedom yet, but look at table 4 in the 0.025 column, and the values drop to about 2 very quickly, so 2 is a reasonable estimate)

So, we can find N now because we have all the information we need

5 = t_{_0.025}*SE_x,

5 = 2 * S, and SE_x = S/sqrt(n)

5 = 2 * 7 /sqrt(n)

sqrt(n) = 14/5 = 2.8

n = 2.8^2 = 7.84, so you need about 8 flasks to make an estimate accurate enough for your purposes

Validity of Confidence Intervals

First, the SE_x must be a valid estimate of

The sample must be chosen from the population in a random manner, such that each member of the population has an equal chance of being in the sample.
The population size (N) must be large when compared with the sample size (n).
The observations must be independent

one observation (x) must not influence the size of other observations (x's)

consider the removal of a sample and then not replacing the sample

you are measuring an enzyme in the gut of rats

you have 5 rats and you take out the intestine and cut it into six pieces. Enzyme concentration is measured in each piece of gut

How many observations do you have? You have 30 (5 rats times 6 pieces per rat)

How many independent observations do you have? You have only 5 because the six from the same rat might all be influenced by the individual characteristics of the rat, and you are not interested in that rat per se, but in all rats

The confidence interval is valid if

the SE_x is valid
the population from which the sample is drawn is randomly distributed

this condition is strict if you want the CI to be exactly valid

this condition can be relaxed if:

the sample size is small and the population distribution is nearly normal

the sample size is large, in which case the population distribution is of no consequence (remember the central limit theorem)

Confidence Interval for a population proportion

The confidence interval for p will need an estimate of the standard error of p and an estimate of p, both of which are presented below and an assumption about t.

The estimate of p is not simply the frequency of an event over the size of the sample, but will include a correction factor of adding 2 to both the numerator and denominator.

If n is large, the correction factor does not change the outcome much

The SE_p is based on the standard error done before (see book about using the normal to estimate the binomial)

The last portion is what to do about t, as we know that a binomial has only two outcomes (success and no success) and this can't be a normal distribution

Central limit theorem allows us to escape from this dilemma, so we will estimate the t0.05 value at 1.96 (from the normal)

Confidence Interval for p

p ±1.96^*SE_p

Last updated September 7, 2006