We were talking about blocking in today’s class experiment, and one student asked, “When should there be unequal units in the treatment and control?”
I replied that the simplest example was when treatment was not expensive. There may be 10,000 people in the population, but 99%will be in the control group because there is only a budget enough to apply treatment to 100 people. In other environments, treatment can be destructive and can only be applied to small parts of the available unit.
However, even if the cost is not interested and wants to maximize the statistical efficiency, it may be reasonable to assign different numbers to the two groups.
For example, I began to assume that your consequences are much more variable under treatment than control. Then, the basic estimates of the treatment effect (the average result of the therapeutic group, the more therapeutic observation to remove the average between the control group, and minimize to explain higher dispersion.
But I stopped for a while. I was confused.
There are two intuitions and the opposite direction.
(1) Treatment observation is more variable than control. Therefore, more treatment measurement is required to obtain an accurate quote for the treatment group.
(2) Processing observation is more variable than the control group. Therefore, the treatment observation should be scared and pay more budget for high quality control measurement.
I felt that the correct reasoning was not (1), but (2), but I was not sure.
How did you solve the problem?
Ruthless power.
Here r:
n <- 100
expt_sim <- function(n, p=0.5, s_c=1, s_t=2){
n_c <- round((1-p)*n)
n_t <- round(p*n)
se_dif <- sqrt(s_c^2/n_c + s_t^2/n_t)
se_dif
}
curve(expt_sim(100, x), from=.01, to=.99,
xlab="Proportion of data in the treatment group",
ylab="se of estimated treatment effect",
main="Assuming sd of measurements is\ntwice as high for treated as for controls",
bty="l")
The results are as follows:
Oh, shooting, I don’t really like how the Y -axis is not zero. Distributed decrease is more dramatic than it is actually. Zero is in the neighborhood, so let’s invite you:
curve(expt_sim(100, x), from=.01, to=.99,
xlab="Proportion of data in the treatment group",
ylab="se of estimated treatment effect",
main="Assuming sd of measurements is\ntwice as high for treated as for controls",
bty="l",
xlim=c(0, 1), ylim=c(0, 2), xaxs="i", yaxs="i")
And we can see the answer. If the control group is more than double the treatment group, it is necessary to perform twice as much as the treatment group. The curve is minimized at x = 2/3 (you can check anything without ploting, but the graph provides some intuition and mental inspection). The above argument 1 is correct.
On the other hand, the standard error of the optimal design is not much lower than the simple 50/50 design, as can be seen by calculating the ratio.
print(expt_sim(100, 1/2) / expt_sim(100, 2/3))
Create 0.95.
Therefore, if the design is improved, the standard error is reduced by 5%. In other words, 10% efficiency increases. Nothing, but not big.
Anyway, the main point of this post is that you can learn a lot from simulation. Of course, in this case, the problem can be solved analytically. S_t/s_c. It’s all okay, but I like the Brute-Force solution.