What are you doing to prevent the test set being leaked? Will you still be offer...

gkamradt · 2025-03-24T21:00:52 1742850052

We have a few sets:

1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public

So for those two we don't have protections.

3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.

4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.

zamadatix · 2025-03-24T23:42:41 1742859761

What prevents everything in 4 from becoming a part of 3 the first time the test set is run on a proprietary model, do you require competitors like OpenAI provide models Kaggle can self host for the test?

gkamradt · 2025-03-24T23:47:17 1742860037

#4 (private test set) doesn't get used for any public model testing. It is only used on the Kaggle leaderboard where no internet access is allowed.

zamadatix · 2025-03-24T23:54:15 1742860455

Sorry, I probably phrased the question poorly. My question is more along the lines of "when you already scored e.g. OpenAI's o3 on ARC AGI 2 how did you guarantee OpenAI can't just look at its server logs to see question set 4"?

gkamradt · 2025-03-24T23:58:58 1742860738

Ah yes, two things

1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

2. We only tested o3 against the semi-private set. We didn't test it with the private eval.

QuadmasterXLII · 2025-03-25T02:02:47 1742868167

Are you aware that OpenAI brazenly lied and went back on its word about its corporate structure, board governance, and for-profit status, and of the opinion that your data sharing agreement is different and less likely to be ignored? Or are you at step zero where you aren’t considering malfeasance as a possibility at all?

zamadatix · 2025-03-24T23:59:26 1742860766

Makes sense, particularly part 2 until "the final results" are needed. Thanks for taking the time to answer my question!

YeGoblynQueenne · 2025-03-25T01:03:55 1742864635

>> We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

Yuri Geller assured us he was bending the spoons with his mind. Somehow it was only when the Amazing Randi was present that Yuri Geller couldn't bend the spoons with his mind.

levocardia · 2025-03-25T03:54:26 1742874866

Ironically "I have a magic AI test but nobody is allowed to use it" is a lot closer to the Yuri Geller situation. Tests are meant to be taken, that should be clear. And...maybe this does not apply in the academic domain, but to some extent if you cheat on an AI test "you're only cheating yourself."

Jensson · 2025-03-25T04:29:56 1742876996

> but to some extent if you cheat on an AI test "you're only cheating yourself."

You cheat investors.

anshumankmr · 2025-03-25T06:59:04 1742885944

And end users and developers and the general public too...

But here is the thing, I feel that even if its rote memorizing why GPT4o couldn't perform just as well on ArcAGI 1 on it or did the "reasoning" help in any way?