1-
Between-individual variation
·
In biology, more than in physics and
chemistry, variation is the rule and the causes of variation are many and
diverse
·
Whether a source of variation is
regarded as due to a factor of interest or to random variation depends on the
particular question being asked
·
If we are interested in how one
characteristic of our experimental subjects (variable A) is affected by two
other characteristics (variables B and C), then A is called dependent variable
(response variable) and b and C are called independent variables (factors).
·
It is likely that A will be affected
by more than just two factors B and C that we are interested in. Any variation
in the dependent variable between individuals in our sample that cannot be
attributed to the independent factors is called random variation
(between-individual variation or noise)
·
We are often interested in removing
the effects of random variation
·
The following are the major tools to
remove or reduce random variation
2- Replication
·
Replication involves making the same
manipulations and taking the same measurements on a number of different
experimental subjects (replicates).
·
Suppose we want to test the
hypothesis: Gender has an effect on human height
·
To test our hypothesis we might find
out the heights of two famous people from the records and conclude that for
example males are taller than females. Of course this is not correct. The
difference in height could have been due to random variation.
·
The solution is to sample more
individuals so that we have replicate males and females. Suppose that we
measure the height of 10 males and 10 females and find out that all males are
taller than all females. Now we can be more confident that there really is a
difference in height. If we found the same thing with 100 males and 100 females
we would be even more confident.
·
What we have done is replicate our
observation. If differences were due to chance (random variation) we would not
expect the same trend to occur in a larger sample.
·
All statistics are based on
replication, and are really just a way of formalizing the idea that the more
times we observe a phenomenon the less likely it is to be occurring simply by
chance.
·
The standard deviation is a measure of
the spread of values around the mean (average) value. If the distribution of
values is a Gaussian shape (normally distributed) then 95% of the values are
within two standard deviations of the mean.
·
How spread out a normal distribution
is, depends on how much variation due to random factors there is, and this is
measured by the standard deviation of the distribution.
·
Replication is a way of dealing with
the between-individual variation (random variation) that will be present in any
biology experiment. The more replicates we have, the greater the confidence we
have that any difference we see between our experimental groups is due to the
factors that we are interested in and not due to chance.
3- Pseudoreplication
·
The effect of replication in removing
the random variation relies on one critical rule “the replicate measures must
be independent of each other.” This means that any individual is just as likely
to have a positive deviation from the norm due to random variation, as it is to
have a negative one. This will cancel out the deviations of these individuals
and the mean of the sample will be close to the mean of the population.
·
For example, if instead of measuring
the height of one man and one woman we measure each of them 10 times. Obviously
this will not let us make any conclusion. The 10 male measures are not
independent of each other; they were made on the same man. All measurements are
correlated
·
Failure to have independent
replicates is a very serious problem for an experimental design, as almost all
statistical tests demand independence.
1) Common sources of
pseudoreplication
·
The use of multiple measurements of
the same individual as if they were independent measurements is an extreme and
obvious form of pseudoreplication
·
Shared enclosure. In experiments
where two groups are placed in two enclosures each with a number of subjects,
these are not independent samples because the effect may be due to differences
in the enclosures. All factors must be the same in both enclosures for the
samples to be independent.
·
Common environment. Similarly, in
experiments where two groups are collected from two environments, all the
factors in both environments must be the same for the samples to be independent
·
Relatedness. Similarity due to
genetics means that relatives are not independent data points when looking at
the effects of other treatments.
·
Pseudoreplicated stimulus. In
experiments where two stimuli are used, if there are differences between the
two stimuli other than the intended one then the data points are not
independent.
·
Individuals are part of the
environment too. If one individual can affect the others in shared enclosure or
environment, then the data points are not independent
·
Pseudoreplication of measurements
through time. Occurs in any experiment where we are taking multiple
measurements through time. Whether measurements are independent will depend
critically on the biology of the system.
·
Species comparisons and
pseudoreplication. In experiments where we select two species or more and carry
out a different treatment on each species, the data points are not independent
because the effect may be due to differences in the species and not sue to
different treatments.
Whether measurements are pseudoreplicates will depend on the biology of the species that you are studying and the questions that you are asking. Thus, pseudoreplication is a problem that has to be addressed by biologists and not by statisticians.
4- Randomization
·
Randomization means that any
individual experimental subject has the same chance of finding itself in each
experimental group. It means drawing random samples for study from the wider
population of all possible individuals that could be in your sample.
·
Randomization is important to avoid
pseudoreplication problems and inadequate replication.
1) Haphazard
sampling
This is not the same as random sampling.
Example:
You have a tank full of 40 snails
that you want to use in an experiment. The experiment requires that you
allocate them to one of four different treatment groups. Random allocation
means that you do the following:
a) Each
snail would be given a number from 1 to 40
b) Pieces
of paper with number 1 to 40 are then placed in a hat
c) Ten
numbers are drawn blindly, and the snails with these numbers allocated to
treatment A
d) Ten
more are drawn and allocated to treatment B, and so on until all snails have
been allocated.
Haphazard sampling would be as
follows:
a) Place
your hand in the tank and take a snail at random
b) Allocate
the first ten snails to treatment A and so on until all snails have been
allocated
There are a large number of reasons
that could cause the first snail to be picked out to be systematically
different from the last snail. Perhaps smaller snails have a better chance of
avoiding your grasp.
2) Self-selection
Phone polls are very poor indicators
of what the wider public believes. Must be very careful when
interpreting data that anyone collected this way.
·
Care must be taken to ensure that the
random sample that we take is representative of the population that we wish to
sample.
·
The need to randomize doesn’t just
apply to the setting up of an experiment, but can equally apply to measuring.
It is far better to organize your sampling procedure so that individuals are
measured in a random order.
·
Your aim is to get a representative
sample and not a random sample. Always ask if the sample is unrepresentative.
Decide whether or not to reject a random sample immediately, generally before
you have looked at the data from that sample.
5-Selecting an appropriate number of replicates
·
A natural question that arises is how
many replicates do we need?
·
The more replicates we have, the more
confident we can be that differences between groups are real and not simply due
to chance effects. However, increasing replication incurs costs, financial,
time costs, welfare or conservation costs.
·
It should be big enough to give you
confidence that you will be able to detect biologically meaningful effects that
exist, but not so big that some sampling was unnecessary.
·
The way to determine the appropriate
number of replications is to use the correct statistical methods
Sample Size
Sample size
for estimating m
Must determine
how accurate the company wants to be. Mean ± E (width of an interval =
2E)
For a 100( 1 - a )% confidence interval for m of the form Mean ± E :
n = [( Za/2)2 s2] / E2
(Requires
knowledge of population variance)
1- Prior
experiments to calculate variance
2- Use the range
(s = range / 4)
Example:
A biologist
would like to estimate the effect of a drug on the growth of a particular bird
species by examining the mean weight of the bird when a fixed amount of the
drug is applied (s = 12 gram). What is the number of observations to estimate
the mean weight of this bird species using a 95% confidence interval with a
half width of 3 grams.
n = (1.96)2
( 12)2 / ( 3 )2
Using a 99%
confidence interval (2.58)
Sample size for
testing m
(based on the magnitude of Type I and II error probabilities
)
Type I : committed if we reject the null hypothesis when it is
true. The probability of Type I error = a
Type II : committed if we accept the null hypothesis when it is
false and the alternative hypothesis is true. The probability of Type II error
= b
1) One sided test
n
= s2 (Za + Zb )2 / D2
(if s is unknown, substitute an
estimated value )
D = | m - mo |
2) Two sided test
n
= s2 (Za/2 + Zb )2 / D2
Sample size
for inferences about m1- m2
n =
2z2a/2s2 / E2
(Both samples
of the same size)
Sample sizes
for testing Ho: m1 - m2 = Do
1) One sided test
n = 2s2 ( Za + Zb )2 / D2
2) Two sided test
n = 2s2 ( Za/2 + Zb )2 / D2
Sample size
for estimating md (based on paired data
difference)
n =
z2a/2s2d / E2
(if sd is unknown, substitute an
estimated value to obtain approximate sample size)
Sample size
required for a 100( 1 - a )% C.I. for binomial
parameter
n =
z2a/2 p ( 1 - p ) / E2
(since p is not known, either substitute an educated
guess or use p = 0.5 which will generate
the largest possible sample size for the specified confidence width)