dap.cards — pre-arrival dossier

№ 1295@tamar-shemesh

Tamar Shemesh

ms student at the technion. spending way too much time on benchmark contamination problems.

currently building

an open eval benchmark for agentic web tasks where the agent operates against a constantly-changing simulated environment — designed to be very hard to contaminate.

asking for

anyone running agent evals on real production traffic. and: people who've cared enough about benchmark integrity to actually rotate test sets.

offering

i'll run your agent on my benchmark for free and send you a detailed failure breakdown. fair warning: most agents fail at the multi-step planning stage.

shipped — on file

·01ms candidate at the Technion, advisor in NLP + evals
·02open benchmark for agentic web tasks, ~30 teams in early access

draft a cold introto anyone on file.

draft a cold intro
to anyone on file.