we spent a few months building evals for a health agent (and the agent itself!). tried to apply anthropic's framework to a real system looking at CGM data + diet.
some of it worked. we got decent at checking form — citations exist, tools were called, numbers trace back. the harder part was essence — is this clinically appropriate? actually helpful? we didn't really solve that.
curious if others building health/bio agents have found ways around this, or if everyone's just accepting fuzzy metrics for the stuff that matters.
foundation models in biology still haven't proven they're worth it vs simpler methods (imo). we just published one in Nature, and i feel i spent more time on "how will we know this worked" than on the model itself. the hard part was (mostly) deciding what success even means. open for thoughts
i honestly dont think there's a simple y/n answer there - i think considerations include mostly like 'how costly it is to do so', 'how often do you think you'll need it', and so on.
traces are not as "ephemeral" as FT models - since you can use those to guide agent behaviour when a newer model is released (but still, not as evergreen as other assets - traces generated using say GPT4 would seem pale and outdated compared to ones created on the same dataset using Opus4.5 i reckon)