BIOL 3110

Biostatistics

Phil

Ganter

302 Harned Hall

963-5782

Lantana flowers- notice the older, outer florets are darker in color

Sampling Distributions

Email me

Back to:

Academic Page 
Tennessee State Home page 
Bio 311 Page 
Ganter home page 

Unit Organization:

Problems:

Problems for homework

  • 5.4, 5.5, 5.18, 5.20, 5.24, 5.33, 5.34, 5.35, 5.35, 5.43

Suggested Problems

  • 5.1 and 5.2, 5.6, 5.12, 5.38, 5.45, 5.50. 5.56

 

Samples and Sample Variability

Sampling Variation

Variation among samples of observations drawn from a single population

samples differ from one another and from the true population value due to random chance (if they differ for any other reason than chance, then the sample is a BIASED sample).

Sampling Distribution

Probability of each of the outcomes possible for a sample taken from a from a population

since all possible outcomes are included, the sum of their probabilities (= total area under curve) must be equal to 1.00

Sampling Distribution from Dichotomous Data

An example of a sampling distribution - The book goes over why the binomial distribution is the sampling distribution for DICHOTOMOUS observations (something is or is not) for n observations

NOTICE that the x-axis in the book is not what you would expect.

Before, the x-axis was J, the number of successes, and the range was from 0 successes to n successes

Look at Figure 5.4 in the book.

The y axis is still the frequency (=probability) of each x value.

The x-axis is not J, but P, the probability of success.

How did one scale get substituted for the other?

One simply has to divide each J by N, and the axis is re-scaled to between 0 and 1

Why do this?

Because we want to use the binomial to predict the probability that the n trials give you a particular rate of success, not a particular number of successes, as we did in chapter 3.

The success rate, p, is p from the binomial

Meta experiments and the difference between p-hat and p

Meta-experiments

You do the same experiment over and over again (in the book it is a sample of 20 trials repeated over and over)

the meta- experiment is the sum of all of the individual experiments

Given that there is a true probability of some outcome (which is p), then we designate the estimate of p as p-hat(p with a caret [^] over it)

p-hat is calculated from the results of the meta-experiment.

p is p (Pi) from the binomial (for dichotomous data)

If p-hat is the estimate of p, then how close can we expect p-hat to be to p? How good is the estimate?

The sampling distribution can be used to estimate this.

This is why we re-scaled the x-axis above. The x-axis represents the probability of getting a particular p-hat. Notice that the most probable outcome in table 5.4 is 6

Pr(6) = 0.192

On the re-scaled x-axis in figure 5.4, 6 becomes 0.3 (6/20 or J/N)

So, the most probable p-hat turns out to be p, the true probability of success

Using the sampling distribution to PREDICT how far p-hat should be from p is an instance of drawing a STATISTICAL INFERENCE.

Notice that statistical inferences never give you a yes-or-no answer, only the probability of a particular outcome

Sample Size

As the sample size (N in the binomial) increases, the sampling distribution of p-hat becomes narrower and narrower

The sample size here is the number of time you draw a sample from the population (it is the number of trials from the binomial for the dichotomous case)

The meaning of this is that, as the sample size increases, the chance that p-hat will be close to p is greater and greater

If you are trying to estimate p, then a larger sample size will most likely give you a better estimate than will a small sample size.

Quantitative Observations

The Dichotomous example was simple because each trial was either a success or not a success

In the Dichotomous case, we asked how close to the predicted number of successes should we get in our sample, in other words, how close to the population proportion of successes (p) should our sample proportion of successes (, p-hat) be?

Quantitative observations have real-number values associated with them so a sample of these observations will have a frequency distribution, mean, median, standard deviation, range, etc.

These are the summary parameters introduced in the first and second chapters

In the Quantitative case, we ask how close to the population parameter should should our sample statistic be?

The most often asked question is how close to the population mean () should our sample mean (x-bar) be.

We will concentrate on the mean for most of the rest of the semester, but you should realize that one can ask this question of any of the parameters.

The answer lies in the Sampling Distribution for that parameter, just as in the Dichotomous case

Our next step is to find out what that distribution is. It is not the binomial, as it was above. (Stop and take a guess here)

The Quantitative MetaExperiment

As before, we have a population

the mean of the population is =

the standard deviation of the population is =

We take repeated samples of size n from the population and calculate the mean (, x-bar). This gives us a bunch of x-bars (n of them).

What is the expected mean and standard deviation of these s?

The mean of the sample means should = . This can be derived most simply through some logical argument. If random chance is the only reason a sample mean differs from , then all of the s should be clustered around and when you take their average, the random errors should about cancel out.

The standard deviation of sample means (s.d.) is not so easy to derive. It is not just . Why is this? Well, in the original population, the values of x had some range, i. e. they varied from the mean. The values of the means of samples will have a range, too and they will vary from the mean. The difference is that the variation among the sample means should be smaller than the variation among the x values. Each is calculated from a sample drawn from the population and will have the effect of some small and some large values. For this reason, the s should usually be close to , the population mean. Compare this to the original data, the x values.  They are spread throughout the original range. Now ask "should the standard deviation of the s be larger or smaller than the standard deviation of the population ()? Well, the s cluster together, near , so they shouldn't vary as much as the elements of the population . The s should have a smaller standard deviation than the population ()

But how much smaller. Sample size is important once again. Larger samples should estimate more closely, that is, the s from larger samples should be closer to than s calculated from a smaller sample. If large-sample x-bars are nearer one another (and ) then their standard deviation should be smaller. So we need a way to reduce to get the standard deviation of the s and that way has to make a larger reduction for big n's (sample sizes) than for smaller n's.

Standard deviation (s) = /sqrt(n)

One last question. What should the frequency distribution of the x-bars look like? The original population might be normal or skewed or have some kurtosis. It might even be bimodal or multimodal. It might be a mess.

Central Limit Theorem

The central limit theorem answers the last question above. It turns out that the distribution of the population has in influence on the distribution of s only when the sample size is small.

As the sample size increases, the distribution of s becomes closer and closer to a normal distribution, no matter what the population's distribution is.

This is a very good thing for two reasons.

First, it lets us use the same methods for drawing inferences about sample means no matter how skewed or kurtotic the original population distributions are. This makes things simpler than a different methodology for each distribution

Second, we know lots of properties of the normal distribution and can make powerful inferences based on these properties

However, notice that this is only true for large sample sizes

We will see that this means that we need to approximate the normal distribution, and that the approximation should get closer to the normal as sample size goes up. We will do this in the next chapter.

 

So far we have three groups of numbers to keep track of. Here is a summary of the three (the book does this in Table 5.7)

  1. The population of x's with a mean of and a standard deviation of
  2. The sample of size n drawn from the population with n x's and a mean of and a standard deviation of s
  3. The group of x-bars calculated from each sample drawn from the population. These s have a mean of m and a standard deviation of /sqrt(n)

 

Normal Approximation of the Binomial Distribution

Once again, the binomial has three parameters, Pr{a success for any particular trial} = , j = the number of success, and n = sample size or number of trials

if n is large, then the binomial distribution can be approximated by a normal distribution with (notice that I use p in this section and the book uses p, but these are the same value - I want to be more consistent)

mean = n

standard deviation = sqrt[n(1-)]

if n is large then the distribution of the probabilities of success () in the samples can be approximated with a normal distribution with:

mean(p-hat) =

standard deviation of p' = sqrt[(1-)/n]

Why is this useful? Its a time saver when n gets large

if n = 200, then what do you need to calculate the chance of getting more than 50 successes?

You have to calculate all of the binomial probabilities for 51 through 200 successes and sum these up.

or

You have to calculate all of the binomial probabilities for 0 through 50 successes, sum these up, and subtract the sum from 1

That's lots of work.

With the normal approximation you have to calculate the mean and standard deviation, z-ify 50, and look up the area associated with that z in the normal table and subtract that area from 1

HOW large is large for n?

if p ~ 0.5, then n can be 10 or so

if p is close to 0 or 1.0, then:

np ~ 5

and

sqrt[n(1-)] ~ 5

are reasonable guidelines (but are arbitrary).

Continuity Correction

Note that, the binomial is discrete (and is represented by a histogram) and the normal is continuous and is represented by a curve.

This means that a continuity correction is called for, especially when n is small

First find the histogram bar width

then add or subtract one half of this value to the x value you are working with

the addition or subtraction should be done to increase the final area under the normal curve

Last updated February 14, 2006