This is why I'm terrible at statistics. To me, it looks almost like wave function collapse.
The results are good only if you do a set number of observations (say 500) instead of waiting for a significant result (say it happens at 623). But what if you had decided to run 623 tests at the beginning?
No problem with that. But compare these two experiments:
for i in range 623:
data.add_result()
s = calculate_significance(data)
if s > 0.95:
publish()
for i in range 623:
data.add_result()
s = calculate_significance(data)
if s > 0.95:
publish()
break
The second one gives you many more chances to succeed, which must result in your confidence in the answer going down.
The results are good only if you do a set number of observations (say 500) instead of waiting for a significant result (say it happens at 623). But what if you had decided to run 623 tests at the beginning?