№ 1295@tamar-shemesh
Tamar Shemesh
ms student at the technion. spending way too much time on benchmark contamination problems.
currently building
an open eval benchmark for agentic web tasks where the agent operates against a constantly-changing simulated environment — designed to be very hard to contaminate.
asking for
anyone running agent evals on real production traffic. and: people who've cared enough about benchmark integrity to actually rotate test sets.
offering
i'll run your agent on my benchmark for free and send you a detailed failure breakdown. fair warning: most agents fail at the multi-step planning stage.
shipped — on file
- ·01ms candidate at the Technion, advisor in NLP + evals
- ·02open benchmark for agentic web tasks, ~30 teams in early access