# Between-individual Variation, Replication and Sampling

1- Between-individual variation

·        In biology, more than in physics and chemistry, variation is the rule and the causes of variation are many and diverse

·        Whether a source of variation is regarded as due to a factor of interest or to random variation depends on the particular question being asked

·        If we are interested in how one characteristic of our experimental subjects (variable A) is affected by two other characteristics (variables B and C), then A is called dependent variable (response variable) and b and C are called independent variables (factors).

·        It is likely that A will be affected by more than just two factors B and C that we are interested in. Any variation in the dependent variable between individuals in our sample that cannot be attributed to the independent factors is called random variation (between-individual variation or noise)

·        We are often interested in removing the effects of random variation

·        The following are the major tools to remove or reduce random variation

2- Replication

·        Replication involves making the same manipulations and taking the same measurements on a number of different experimental subjects (replicates).

·        Suppose we want to test the hypothesis: Gender has an effect on human height

·        To test our hypothesis we might find out the heights of two famous people from the records and conclude that for example males are taller than females. Of course this is not correct. The difference in height could have been due to random variation.

·        The solution is to sample more individuals so that we have replicate males and females. Suppose that we measure the height of 10 males and 10 females and find out that all males are taller than all females. Now we can be more confident that there really is a difference in height. If we found the same thing with 100 males and 100 females we would be even more confident.

·        What we have done is replicate our observation. If differences were due to chance (random variation) we would not expect the same trend to occur in a larger sample.

·        All statistics are based on replication, and are really just a way of formalizing the idea that the more times we observe a phenomenon the less likely it is to be occurring simply by chance.

·        The standard deviation is a measure of the spread of values around the mean (average) value. If the distribution of values is a Gaussian shape (normally distributed) then 95% of the values are within two standard deviations of the mean.

·        How spread out a normal distribution is, depends on how much variation due to random factors there is, and this is measured by the standard deviation of the distribution.

·        Replication is a way of dealing with the between-individual variation (random variation) that will be present in any biology experiment. The more replicates we have, the greater the confidence we have that any difference we see between our experimental groups is due to the factors that we are interested in and not due to chance.

3- Pseudoreplication

·        The effect of replication in removing the random variation relies on one critical rule “the replicate measures must be independent of each other.” This means that any individual is just as likely to have a positive deviation from the norm due to random variation, as it is to have a negative one. This will cancel out the deviations of these individuals and the mean of the sample will be close to the mean of the population.

·        For example, if instead of measuring the height of one man and one woman we measure each of them 10 times. Obviously this will not let us make any conclusion. The 10 male measures are not independent of each other; they were made on the same man. All measurements are correlated

·        Failure to have independent replicates is a very serious problem for an experimental design, as almost all statistical tests demand independence.

1) Common sources of pseudoreplication

·        The use of multiple measurements of the same individual as if they were independent measurements is an extreme and obvious form of pseudoreplication

·        Shared enclosure. In experiments where two groups are placed in two enclosures each with a number of subjects, these are not independent samples because the effect may be due to differences in the enclosures. All factors must be the same in both enclosures for the samples to be independent.

·        Common environment. Similarly, in experiments where two groups are collected from two environments, all the factors in both environments must be the same for the samples to be independent

·        Relatedness. Similarity due to genetics means that relatives are not independent data points when looking at the effects of other treatments.

·        Pseudoreplicated stimulus. In experiments where two stimuli are used, if there are differences between the two stimuli other than the intended one then the data points are not independent.

·        Individuals are part of the environment too. If one individual can affect the others in shared enclosure or environment, then the data points are not independent

·        Pseudoreplication of measurements through time. Occurs in any experiment where we are taking multiple measurements through time. Whether measurements are independent will depend critically on the biology of the system.

·        Species comparisons and pseudoreplication. In experiments where we select two species or more and carry out a different treatment on each species, the data points are not independent because the effect may be due to differences in the species and not sue to different treatments.

Whether measurements are pseudoreplicates will depend on the biology of the species that you are studying and the questions that you are asking. Thus, pseudoreplication is a problem that has to be addressed by biologists and not by statisticians.

4- Randomization

·        Randomization means that any individual experimental subject has the same chance of finding itself in each experimental group. It means drawing random samples for study from the wider population of all possible individuals that could be in your sample.

·        Randomization is important to avoid pseudoreplication problems and inadequate replication.

1)     Haphazard sampling

This is not the same as random sampling.

Example:

You have a tank full of 40 snails that you want to use in an experiment. The experiment requires that you allocate them to one of four different treatment groups. Random allocation means that you do the following:

a)     Each snail would be given a number from 1 to 40

b)    Pieces of paper with number 1 to 40 are then placed in a hat

c)     Ten numbers are drawn blindly, and the snails with these numbers allocated to treatment A

d)    Ten more are drawn and allocated to treatment B, and so on until all snails have been allocated.

Haphazard sampling would be as follows:

a)     Place your hand in the tank and take a snail at random

b)    Allocate the first ten snails to treatment A and so on until all snails have been allocated

There are a large number of reasons that could cause the first snail to be picked out to be systematically different from the last snail. Perhaps smaller snails have a better chance of avoiding your grasp.

2)     Self-selection

Phone polls are very poor indicators of what the wider public believes. Must be very careful when interpreting data that anyone collected this way.

·        Care must be taken to ensure that the random sample that we take is representative of the population that we wish to sample.

·        The need to randomize doesn’t just apply to the setting up of an experiment, but can equally apply to measuring. It is far better to organize your sampling procedure so that individuals are measured in a random order.

·        Your aim is to get a representative sample and not a random sample. Always ask if the sample is unrepresentative. Decide whether or not to reject a random sample immediately, generally before you have looked at the data from that sample.

5-Selecting an appropriate number of replicates

·        A natural question that arises is how many replicates do we need?

·        The more replicates we have, the more confident we can be that differences between groups are real and not simply due to chance effects. However, increasing replication incurs costs, financial, time costs, welfare or conservation costs.

·        It should be big enough to give you confidence that you will be able to detect biologically meaningful effects that exist, but not so big that some sampling was unnecessary.

·        The way to determine the appropriate number of replications is to use the correct statistical methods

Sample Size

Sample size for estimating m

Must determine how accurate the company wants to be. Mean ± E (width of an interval = 2E)

For a 100( 1 - a )% confidence interval for m of the form Mean ± E :

n = [( Za/2)2 s2] / E2

(Requires knowledge of population variance)

1- Prior experiments to calculate variance

2- Use the range (s = range / 4)

Example:

A biologist would like to estimate the effect of a drug on the growth of a particular bird species by examining the mean weight of the bird when a fixed amount of the drug is applied (s = 12 gram). What is the number of observations to estimate the mean weight of this bird species using a 95% confidence interval with a half width of 3 grams.

n = (1.96)2 ( 12)2 / ( 3 )2

Using a 99% confidence interval  (2.58)

Sample size for testing m

(based on the magnitude of Type I and II error probabilities )

Type I : committed if we reject the null hypothesis when it is true. The probability of Type I error = a

Type II : committed if we accept the null hypothesis when it is false and the alternative hypothesis is true. The probability of Type II error = b

1) One sided test

n   =     s2  (Za + Zb )2  /  D2

(if s is unknown, substitute an estimated value )

D = | m  -  mo  |

2) Two sided test

n   =     s2  (Za/2 + Zb )2  /  D2

Sample size for inferences about m1- m2

n  =  2z2a/2s2  /  E2

(Both samples of the same size)

Sample sizes for testing Ho: m1 - m2 = Do

1) One sided test

n =  2s2 ( Za + Zb )2 / D2

2) Two sided test

n =  2s2 ( Za/2 + Zb )2 / D2

Sample size for estimating md (based on paired data difference)

n  =    z2a/2s2d /  E2

(if sd is unknown, substitute an estimated value to obtain approximate sample size)

Sample size required for a 100( 1 - a )% C.I. for binomial parameter

n  =  z2a/2  p ( 1 - p ) /  E2

(since p is not known, either substitute an educated guess or use p = 0.5 which will generate the largest possible sample size for the specified confidence width)