Here we will trick a statistical analysis with what is known as a p-value optimization.

We will test if clapping your boots three time before leave your house in the morning will increase your average salary. Try it, it doesnt.

In [19]:
# set up some random data.



income = rnorm(1000,mean=5000,sd=200)  # your sample. Everybody must clap boots every day for 3 years, then you record salary. You know that average salary of this population is 5000.



In [29]:
#now lets do a T test. Your hypothesis is that the average salary is larger than the known average 5000.

for (i in seq(1,989,by=1))
    
    {
    partialdata=income[i:(i+9)]
    T=(mean(partialdata)-5000)/sd(partialdata)*sqrt(10)
    if (T>qt(p=0.95,df=9))
        {break}   
    }

i
T

There you go, clap your boots to become rich! Where do we cheat? Incomplete data! We have performed i+10 experiments, but only analyze 10 datapoints, dropping those before i. Note that we did not have to "pick" specific values, like low incomes, that did not suit us. We did blindly, without bias, take 10 random datapoints. We let statistical fluctuations do the dirty work.

Typical setting to look out for: "Student A does not show effect" "Probably student A was not skilled" "Let student B have a go". "Student B is also not skilled" "Lets see if student C can do it"...

In [30]:
#now lets do a proper T test. Your hypothesis is that the average salary is larger than the known average 5000.

for (i in seq(1,989,by=1))
    
    {
    partialdata=income[1:(i+9)]
    T=(mean(partialdata)-5000)/sd(partialdata)*sqrt(10)
    if (T>qt(p=0.95,df=9))
        {break}   
    }

i
T

The dirty variation: The Chocolate Diet

It is obvious that omitting datapoints from a study, irrespective of whether you pick specific points or choose some at random, is dubious. How about dropping factors?

We take 1000 people from the same population, sort them into 100 groups a 10 people each, and subject them to totally nonesense diets. "Liquid butter diet", "Travel to the moon diet", "Chicken nuggets diet", etc. None of these actually work.

In [65]:
# set up some random data. The value denotes the weight loss/gain after 6 weeks of diet.

n=10 #group size
m=100 # number of groups

weightdata = matrix( rnorm(n*m,mean=0,sd=3), n, m) # weightloss of participants. mean=0, so all 100 diets are useless.

In [66]:
summary(weightdata)

       V1                V2                V3                V4         
 Min.   :-2.7387   Min.   :-6.2774   Min.   :-4.5216   Min.   :-2.4288  
 1st Qu.:-1.7837   1st Qu.:-0.2651   1st Qu.:-0.6638   1st Qu.:-0.1889  
 Median : 0.2215   Median : 2.0853   Median : 1.6549   Median : 0.9696  
 Mean   : 0.2599   Mean   : 1.1211   Mean   : 1.1930   Mean   : 1.0904  
 3rd Qu.: 1.2789   3rd Qu.: 3.4290   3rd Qu.: 2.6253   3rd Qu.: 2.1775  
 Max.   : 6.0157   Max.   : 4.5204   Max.   : 6.3859   Max.   : 4.4411  
       V5                V6                V7               V8         
 Min.   :-3.6839   Min.   :-3.6010   Min.   :-5.257   Min.   :-3.9968  
 1st Qu.:-0.8926   1st Qu.:-2.9950   1st Qu.:-3.309   1st Qu.:-2.4232  
 Median : 0.2585   Median : 0.7410   Median :-2.115   Median :-0.2503  
 Mean   :-0.2065   Mean   :-0.2975   Mean   :-1.351   Mean   : 0.1509  
 3rd Qu.: 1.1194   3rd Qu.: 1.5913   3rd Qu.: 1.387   3rd Qu.: 2.6697  
 Max.   : 1.8276   Max.   : 3.1081   Max.   : 2.876   Max

In [81]:
successfuldiets=vector()

for (i in seq(1,100))
    
    {
    mypvalue=t.test(weightdata[,i],mu=0,alternative="less")$p.value
    
    if (mypvalue<0.05)
        {successfuldiets=c(successfuldiets,i)}
    
    
}

In [82]:
successfuldiets